AI - Exam Prep - Summary Artificial Intelligence PDF

Title	AI - Exam Prep - Summary Artificial Intelligence
Author	Taqi Tazwar
Course	Artificial Intelligence
Institution	Brunel University London
Pages	17
File Size	590.7 KB
File Type	PDF
Total Downloads	480
Total Views	851

Preview

CLICK TO PREVIEW PDF

Summary

Clustering Learning Unsupervised Learning: Learning without the desired outputs signals). Supervised Learning: Learning with the desired output. Reinforcement Learning: Learning with signals Clustering is unsupervised, while classification is supervised. Different Clustering Algorithms differ on Con...

Description

Clustering Learning   

Unsupervised Learning: Learning without the desired outputs (‘teacher’ signals). Supervised Learning: Learning with the desired output. Reinforcement Learning: Learning with reward/punishment signals

Clustering is unsupervised, while classification is supervised.

Different Clustering Algorithms differ on   

Converge criteria (How to stop after clustering) How data-points are assigned clusters Similarity Measure (Which data-points are similar)

Clustering Process   

Extract features (colour, movement, sensory organs etc) Cluster into categories Consolidation

Clustering: To partition a dataset into subsets (clusters), so that the data in each subset share some common trait, often similarity or proximity for some defined distance metric.

Clustering   

Patterns: Physical Objects Clusters: Categories of Objects Features: Attributes of Objects

Pattern Similarity   

Key concept in clustering: Similarity Clusters are formed by similar patterns Need to define metric to measure similarity. One of the commonly adopted similarity metric is distance

Distances

For the table below, the Euclidian and Manhattan Distance will be calculated:

Euclidian Distance

Manhattan Distance

K-Means Clustering Algorithm 1 1. 2. 3. 4.

Place K points into the feature space. These points represent initial cluster centroids Assign each pattern to the closest cluster centroid When all objects have been assigned, recalculate the positions of the K centroids Repeat step 2 and 3 until the assignments do not change

Limitations of K-Means   

At each iteration of the K-Means Clustering algorithm, a pattern can be assigned to one cluster only, i.e. the assignment is hard If a pattern is in the middle of two cluster centroids, it will have to be assigned to only one of them, even if intuitively it should have equal contributions to both Cannot efficiently solve problems with elongated clusters or clusters with different variations (e.g. circle clusters)

K-Means Pros & Cons Pros  

Computationally faster than hierarchical clustering (if K is small) May produce tighter clusters than hierarchical clustering, especially if the clusters are global

Cons   

Fixed number of clusters can make it hard to predict what K should be Different initial partitions can result in different final clusters Potential empty clusters (not always bad)



Does not work well with non-globular clusters

Hierarchical Clustering Hierarchical Clustering results in a series of clustering results. Results start off with each object in their own cluster and end with all of the objects in the same cluster. The resultant tree like structure is called a dendrogram.

Algorithm 1. 2. 3. 4. 5.

Each item is assigned its own cluster (n clusters of size one) Let the distances between the clusters equal the distances between the objects they contain Find the closest pair of clusters and merge them into a single cluster (one less cluster) Re-compute the distances between the new cluster and each of the old clusters Repeat steps 3 and 4 until there is only one cluster left

Re-computing Distances Single: The smallest distance between any two pairs from the two clusters (one from each) being compared/measured Average: The average distance between pairs Complete: The largest distance between any two pairs from the two clusters (one from each) being compared/measured

Hierarchical clustering Pros & Cons

Pros  

Can produce an ordering of the objects, which may be informative for data display Smaller clusters are generated, which may be helpful for discovery

Cons  

No provision can be made for a relocation of objects that may have been ‘incorrectly’ grouped at an early stage Use of different distance metrics for measuring distances between clusters may generate different results

Fuzzy K-means Can deal with overlapping clusters by assigning weights-per-cluster to data rather than a single cluster allocation. However, deciding how to allocate clusters based on this is not always easy.

Classification Classification attempts to solve the problem:   

Given some data Where each case in the data has been allocated a class Assign a class to a new unassigned case

Decision Trees   

Nodes represent decisions Arcs represent possible answers Terminal nodes represent classification

Classifying with a Decision Tree    

Transverse tree starting at the root node At each decision node follow the appropriate branch according to the case that you are classifying Repeat until you reach a leaf node Classify according to the label at the leaf node

Building a Decision Tree 

Often known as rule induction

  

Nodes are repeatedly split until all elements represented belong to one class Nodes then become terminal nodes Deciding which nodes to split next as well as the evaluation function used to split it depends on the algorithm

Pruning Decision Trees    

Simpler models are preferred Known as “Occam’s Razor” Prevents overfitting Tree pruning can be performed o Goes through each decision node o Considers converting it into a leaf node o If this does not reduce classification accuracy then the pruning is carried out

K-Nearest Neighbour (KNN)   



Case based reasoning Based on the idea that items that are located “nearby” in the data space will influence how unknown data is classified Requires: o Distance Metric o k parameter (no. of neighbours) o Weighting function o How to combine info from neighbours Distance Metrics: o Euclidian o Correlation o Mahalanobis

Testing Performance     

Aim is to classify new unseen data Error = number of errors / number of cases Empirical error rate is not the same as true error rate Empirical is based on a sample True is based on infinite cases

Training and Test Sets     

Known as the holdout method Use training set to learn model Use test set to score accuracy Ideally the two sets are independent (generated separately) We want to estimate the true error



If the test set is independent then a small number of cases are needed to approximate the true error rate

Cross Validation 

  

K-fold Cross Validation: o Randomly split the data up into k subsets of equal size o Remove one of the subsets and train classifier on remaining subsets o Test on the removed subset o Repeat for all subsets For extremely small samples use Leave One Out Cross Validation where k = 1 Computationally expensive Cross Validation considered an unbiased estimator of true error

Sensitivity & Specificity   

Sensitivity: TP / C+ Specificity: TN / CAccuracy: (TP + TN) / ((C+) + (C-))

For example, a classifier can have high sensitivity if it successfully classifies people who have developed cancer, but a low specificity if it also classifies non-sufferers with cancer.

Bias: Some systematic error in the model Variance: Difference from one model to the next Overly complex models more likely to overfit

Images & Vision Edge Detection   

Edge: Area of significant change in the image intensity/contrast Edge Detection: Locating areas with strong intensity contrasts Use of Edge Detection: Extracting information about the image, e.g. location of objects present in the image, their shape, size, image sharpening and enhancement

Types of Edges

Variation of Intensity/Grey Level   

Step Edge Line Edge Roof Edge

Steps in Edge Detection    

Filtering: Filter image to improve performance of the Edge Detector wrt noise Enhancement: Emphasize pixels having significant change in local intensity Detection: Identify edges – thresholding Localization: Locate the edges accurately, estimate edge orientation

Roberts Operator Spurious dots indicate that operator is susceptible to noise Sobel Operator   

Less susceptible to noise Produces thicker edges Edge localisation is poor

Sobel vs Roberts   

Spurious edges are still present but they are relatively less intense compared to genuine lines Roberts operator has missed a few edges Sobel operator detects thicker edges

Canny Edge Detector 



  



Step 1 o o Step 2 o o Step 3 o Step 4 o Step 5 o

Noise is filtered out – usually a Gaussian filter is used Width is chosen carefully Edge strength is found out by taking the gradient of the image A Roberts mask or a Sobel mask can be used Find the edge direction Resolve edge direction Non-maxima suppression – trace along the edge direction and suppress any pixel value not considered to be an edge. Gives a thin line for edge

Step 6 o Use double / hysteresis thresholding to eliminate streaking

Corner Detector We should easily localize the point by looking through a small window. Shifting a window in any direction should give a large change in intensity.

Harris Detector

Line Detection

Edge Tracking Methods 

Given an approximate location of Boundary, the accurate location needs to be found



o Search for strong edges along normal to approximate boundary o Fit curve (e.g. polynomials) to strong edges Given point A and B in which the Boundary lies, find the Boundary o Connect A and B with Line o Find strongest edge along line bisect o Use edge point as break point o Repeat

Line Grouping Problem 

  

Extraneous data: clutter or models o We do not know what is part of the model? o Can we pull out models with a few parts from much larger amounts of background clutter? Missing data: only some parts of the model are present Noise Cost: It is not feasible to check all combinations of features by fitting a model to each possible subset

Hough Transform     

Elegant method for direct object recognition Edges need not be connected Complete object need not be visible Key Idea: Edges VOTE for the possible model Parameter space is also called the Hough Space

Application Issues 





Difficulties o How big should the cells be? (too big and we merge quite different lines; too small and noise causes lines to be missed) How many lines? o Count the peaks in the Hough Array o Treat adjacent peaks as a single peak Which points belong to each line? o Search for points close to the line o Solve again for line and iterate

Motion Estimation  

Each image is divided into small blocks The current image frame is referred to as Target Frame

   

A match is sought between the block in the Target Frame and the most similar block in the previous and/or future frame(s) (referred to as Reference Frame(s)) This displacement of the reference block to the target block is called a Motion Vector (MV) MV Search is usually limited to a small immediate neighbourhood – both horizontal and vertical displacements in the range This makes a search window of size (2p + 1) x (2p + 1)

Reference Frame (left), Current Frame (middle), and Motion Vectors (right).

The difference between two blocks can be measured by their Mean Absolute Difference (MAD)

Neural Network

(x1 * w1) + (x2 * w2) + θ Using the formula, the above example would be (2 * 0.5) + (1 * 0.3) – 1 = 0.3, as 0.3 > 0, the output Y will be +1.

Using the previous example where the true output is 0 (Y d = 0) x1= 2, x2= 1 w1= 0.5, w2= 0.3, θ= -1 Yd=0, Y= 1 The weights will be updated using the formula w1(p+1) = w1 + (Yd – Y) * x1, θ(p+1) = θ + (Yd – Y)   

w1(p+1) = 0.5 + (0 - 1) * 2 = -1.5 w2(p+1) = 0.3 + (0 - 1) * 1 = -0.7 θ(p+1) = -1 + (0 – 1) = -2

If Yd < Y, decrease w1 and increase w2 If Yd > Y, increase w1 and decrease w2

Multilayer Neural Networks     

Natural extension to the perceptron Deals with non-linearly classifiable data Essentially involves linking numerous perceptron’s together All nodes from one layer are connected to the next one Ca be extended to any number of outputs and hidden layers

XOR

XOR(0,0) = 0 XOR(0,1) = 1 XOR(1,0) = 1 XOR(1,1) = 0

Backpropagation

       

Generalisation of the perceptron training procedure Also iterative with adjustments after each case is presented Two stages in the algorithm Forward propagation of the input pattern from input to output layer Back propagation of the error vector from output to the input layer There will be many optima (solutions) unlike the perceptron Sigmoid function is used rather than step Learning rate and error derivative used to scale the weight adjustment

Can be used in forecasting, tracking, deep learning.

Neural Network Training Algorithms   

Gradient Descent Newton’s Method Conjugate Gradient

Expert Systems Expert systems are software packages designed to assist humans in situations in which an expert in a specific area is required. Knowledge is a theoretical or practical understanding of a subject or a domain.

Major tasks include -

Obtain the required knowledge from an expert Express knowledge as a collection of rules in the form of logical implications (knowledge base) Extract conclusions (reasoning)

Knowledge base: A set of rules describing knowledge of a specific domain. Rules can have multiple antecedents joined by keywords AND, OR or a combination of both. Inference engine: Carries out the reasoning whereby the expert system reaches a solution. IT links the rules given in the knowledge base with the fact provided in the database. Reasoning: A mechanism for selecting the relevant facts and extracting conclusions from them in a logical way. Example: Given that the rule, ‘If A Then B’ and the fact A are both in the knowledge base, then the fact B can be concluded.

Expert systems are good in knowledge representation, uncertainty tolerance and explanation ability.

Rule-based expert system     

In a rule-based expert system, the domain knowledge is represented by a set of IF-THEN production rules and data is represented by a set of facts about the current situation. The inference engine compares each rule stored in the knowledge base with facts contained in the database. When the IF (condition) part of the rule matches a fact, the rule is fired and its THEN (action) part is executed. The matching of the rule IF parts to the facts produces inference chains. The inference chain indicates how an expert system applies the rules to reach a conclusion.

Forward chaining         

It is the data-driven reasoning Reasoning starts from the known data and proceeds forward with that data Each time only topmost rule is executed When fired, the rule adds a new fact in the database. Any rule can be executed once only The match-fire cycle stops when no further rules can be fired It is a technique for gathering information and then inferring from it whatever can be inferred However, in forward chaining, many rules may be executed that have nothing to do with the established goal. Therefore, if our goal is to infer only one particular fact, forward chaining inference technique would not be efficient.

Backward chaining:        

It is goal-driven reasoning In backward chaining, an expert system has a goal, and the inference engine attempts to find the evidence to prove it. First, the knowledge base is searched to find rules that might have the desired solution. Such rules must have the goal in their THEN (action) parts. If such a rule is found and its IF (condition) part matches data in the database, then the rule is fired and the goal is proved However, this is rarely the case. Thus, the inference engine puts aside the rule it is working with (the rule is said to stack) and sets up a new goal, a sub goal, to prove the IF part of this rule. Then the knowledge base is searched again for rules that can prove the sub goal. The inference engine repeats the process of stacking the rules until no rules are found in the knowledge base to prove the current goal.

Choosing between forward and backward chaining  

If an expert first needs to gather some information and then tries to infer from it whatever can be inferred, choose the forward chaining inference engine. However, if your expert begins with a hypothetical solution and then attempts to find fact to prove it, choose the backward chaining inference engine.

Conflict resolution: When two rules have same conditions but different actions, it represent a conflict set. The inference engine must determine which rule to fire from such a set. A method of choosing a rule to fire when more than one rule can be fired in a given cycle is called conflict resolution. 

In forward chaining both rules would be fired. The topmost rule is fired first and therefore its THEN part is executed and linguistic object action obtains value. The other rule is also fired because the condition part of the rule matches with the fact which is in the database and as a result, object action takes new value.

Methods used for conflict resolution:   

The rule with the highest priority is fired. Priority can be established by placing the rules in an appropriate order in the knowledge base. Works well with around 100 rules. Fire the most specific rule. It assumes that a specific rule processes more information than a general one Fire the rule that uses the data most recently entered in the database. It relies on the time tag attached to each fact in the database.

Rule-Based Expert Systems Pros & Cons

Pros   



Natural knowledge representation (problem solving procedures can be represented naturally as IF-THEN production rule) Uniform structure (Production rules have the uniform IF-THEN structure) Separation of knowledge from it processing (As the knowledge base is separate from the inference engine, it makes it possible to develop different applications using the same expert system shell. Dealing with incomplete and uncertain knowledge (Most rule-based expert systems can represent and reasoning with incomplete and uncertain knowledge)

Cons  



Opaque relations between rules Ineffective search strategy (The interference engine applies an exhaustive search through all the production rules during each cycle. Expert systems with a large set of rules can be slow and therefore unsuitable for real time application. Inability to learn (ES do not learn from experience, it cannot automatically modify its knowledge base or adjust existing rules or add new ones.)

Bayesian Networks Bayesian network: A representation of a joint probability of a set of random variables with a possible mutual causal relationship. It contains:  

Casual structure with interconnected nodes (directed acyclic links) Conditional distributions at each node

Markov Blanket: Set of parents of a node, the children of a node, and the parents of the children of a node. It renders the node independent of all other nodes.

D-Separation: A criterion for deciding, from a given a causal graph, whether a set X of variables is independent of another set Y, given a third set Z. The idea is to associate...