COMP14112 Notes PDF

Title	COMP14112 Notes
Author	Ian Winstanley
Course	Fundamentals of Artificial Intelligence
Institution	University of Manchester
Pages	8
File Size	123.6 KB
File Type	PDF
Total Downloads	12
Total Views	124

Preview

CLICK TO PREVIEW PDF

Summary

A set of notes for the course Fundamentals of Artificial Intelligence at the University of Manchester....

Description

COMP14112 • Brief History of AI • AI can be thought of as programs which replicate human thought meaning a program which would pass the Turing test" • It generally refers to computer systems which behave rationally" • This means a system which takes the best option to achieve a goal based on its knowledge / belief" • Rationality does not mean that it always achieves the goal but it gives itself a maximum chance of success based on the information it has been provided" • In some circumstances, it is very diﬃcult to choose the best option possible but a system can achieve limited rationality" • Alan Turing was the ﬁrst person to consider if a computer could behave in a similar way to a human" • AI breakthroughs have commonly been applied to game playing as there is a ﬁxed set of possible outcomes but it is sometimes not feasible to process all of them" • Other types of AI use deductive logic which relies on propositional logic systems to perform tasks such as theorem proving" • However, logic does not suitable allow for the uncertainty faced by most AI systems" • AI systems can use semantic networks to represent relations between known objects which can assist with reasoning" • Most fully rational systems can only be solved in exponential time so often heuristics have to be used to gain an approximate answer" • Increased computing power was important to allowing AI systems to gain suﬃciently accurate answers to problems" • Artiﬁcial neural networks are used to solve AI problems by attempting to simulate neural networks in a biological brain" • Statistical based reasoning can be used by AI systems to determine the action most likely to be successful from uncertain information" • Robot Localisation • The robot localisation problem consists of a robot in a static, known environment with obstacles but it is unaware of its location within that environment" • The robot has sensors which are noisy and so return an approximate measurement of the robots distance for an obstacle or wall" • The sample space for a problem is the set of all possible outcomes" • An event is any subset of the sample space" • A probability distribution assigns a probability to any event in the sample space between 0 and 1" • The probability of the entire sample space must be 1" • If two events are mutually exclusive, they cannot both happen at the same time and so occupy disjoint parts of the sample space" • If 2 events are mutually exclusive, the probability of event 1 and event 2 happing will be 0 and the probability of event 1 or event 2 happening will the the sum of the probabilities for event 1 and event 2" • The probability of an event E is written as p(E)" • This can often represent the degree of belief a system has in the event E based on the information it has received" • Therefore this probability can be subjective depending on what information the system has been given" • The complement of any event is any event where the original event does not happen" • The probability of any event is equal to 1 minus the probability of that event"

• If an event E is a subset of an event F, it means that F happening means E must also happen" • This means the probability of E must be less than or equal to the probability of F" • For any two events E and F, the probability of event E or event F happening is equal to the probability of event E happening added to the probability of event F happening minus the probability of both event E and event F happening" • The empty set is a subset of a sample space and it always has a probability of 0" • A set of events are mutually exclusive if, for any pair of the events, the probability of both events happening is 0" • A set of events is jointly exhaustive if the conjunction for all events in the set is equal to 1" • A set of events form a partition of the sample space if they are mutually exclusive and jointly exhaustive" • If a set of events form a partition, it means exactly one will be true at any time" • If a set of events form a partition, the sum of their probabilities will be equal to 1" • In the robot localisation problem the robot has to base its belief of its current position based on the information it has been told about its initial situation, the information it has obtained from its sensors and the information it has about its movements" • The knowledge about its initial situation is often incomplete" • Sensor readings can be noisy or inaccurate and the robots actions may also be noisy" • The space where the robot is known to be contained in can be divided into many smaller squares" • The robots direction can also be divided into several non-overlapping intervals" • The robots position combined with the direction it is facing can be thought of as its pose" • The events of the robot being in each possible pose form a partition" • The robot can represent its current belief about being in each possible pose as a probability distribution" • The robot initially has no knowledge about its current position except that it is not on a square which is occupied by an obstacle" • Therefore, initially, each unoccupied square which does not contain an obstacle will be assigned an equal probability and each orientation in each square will also have an equal probability" • The probabilities are normalised to ensure the probability of each pose sums to 1" • Therefore, the probability of being in each pose which is not occupied by an obstacle is equal to 1 divided by the number of unoccupied poses" • The events of the robot being located in each square also form a partition as well as the events of the robot being in each orientation" • For two events in a probability distribution, the probability that an event E happens given that an event F happens is equal to the probability of both E and F happen divided by the probability that F happens" • This means that the probability of E given F is equal to the probability of F given E multiplied by the probability of E divided by the probability of F" • This is known as Bayes’ theorem" • If a set of events form a partition, the extended total probability formula says that the probability of any of the events happening given another event F is the sum of the probability of any event in the partition happens given both F and the particular event in the partition happens multiplied by the probability that the particular event happens given F happens for each event in the partition" • This can be usefully substituted into Bayes’ theorem" • A random variable is a function which assigns a real value in a sample space to a variable X"

• The value of X is probabilistic meaning it is impossible to determine what the exact value will be" • The cumulative distribution function assigns a probability that the value of X will be less than or equal to the number provided" • If X can be any real number in an interval, the probability it is equal to a speciﬁc number is always 0 as there is an inﬁnite number of real numbers in every interval" • A discrete random variable is one where the values are always contained within a countable set" • A continuous random variable is one where the values are contained within a portion of the real number line" • For a discrete random variable, the probability mass function gives the probability that the random variable will be equal to a particular variable" • The probability density function for a random variable is given by the derivative of the cumulative distribution function" • The value of a probability density function for any value will always be greater than or equal to 0 and less than or equal to 1" • The integral between minus inﬁnity and inﬁnity of a probability density function will always be equal to 1" • The integral between any interval of a probability density function gives the probability that the random variable will lie in that interval" • Many random variable follow a distribution called the normal distribution where values are more likely to be closer to a mean" • The expected value of a random variable is the mean average value it is likely to take upon occurring many times" • For a discrete random variable, this is calculated by summing the product of each value possible multiplied by the probability of that value" • For a continuous random variable, this is calculated as the integral between minus inﬁnity and inﬁnity of the product of the value multiplied by the value of the probability density function at that value" • The average spread of the values of a random variable from the mean is known as the standard deviation and is computed via the variance" • The variance of a discrete random variable is the sum of the diﬀerence between that value and the mean squared multiplied by the probability mass function at that value for every possible value of the random variable" • The variance of a continuous random variable is the integral between minus inﬁnity and inﬁnity of the diﬀerence between the value and the mean squared multiplied by the probability density function at that value" • The standard deviation is calculated by taking the square root of the variance" • The sensor values in the robot localisation problem follow a normal distribution with a mean equal to the true value and a known standard deviation, but without ever being negative" • When the robot makes a sensor reading, it updates the probability distribution for each pose to the probability it is in that pose given the observation that was made" • Since the robot knows about the environment, it can calculate the probability an observation was made in a particular pose by following the normal distribution and observing the distance it would be from an obstacle in that particular direction" • Bayes’ theorem then allows the updated probability of each pose to be calculated by setting the numerator to the probability the observation was made given the robot was in that pose multiplied by the current belief the robot has that it is in that pose" • The denominator is determined by summing up the product of the probability of an observation given a particular pose multiplied by the belief the robot has that it is in particular pose for each possible pose"

• The denominator of the Bayes’ equation in the instance gives the probability that the observation occurred at all given the robots current belief state" • The robot also need to update its probability distribution on an action" • If actions can be assumed to be deterministic, the probability of each pose is updated by the sum of all the probabilities of the robot’s current belief state where the speciﬁed action takes it to that pose" • If the action is noisy and can take multiple values for the distance the robot moves / turns, the probability is updated by the sum of all possible actions which could happen which take the robot to that pose multiplied by the probability that the action happens" • The probability each action happens is a normal distribution with the intended value as the mean and a known standard deviation" • Probabilistic Reasoning" • Uncertainty can be understood in terms of betting" • By the reasoning of the dutch book, if there is a possibility of winning £1 or winning nothing when purchasing a ticket, there will be price considered fair between £0 and £1" • A fair price is one where, when purchasing the ticket a large number of times, no money is gained or lost" • The fair price is equal to the degree of belief in the event of winning multiplied by the stake for winning" • The average pay is the price of the ticket multiplied by the number of times it is purchased" • The average gain is the probability of success multiplied by the number of times it is purchased multiplied by the amount you get for winning" • For a fair price, the average pay will always be equal to the average gain" • A dutch book is a series of bets which are all considered fair but will always cause the agent to lose money" • A dutch book can be constructed whenever there are two events which are not mutually exclusive" • If there is a condition for getting the money back, conditional probability can be used to determine the average payoﬀ in each situation" • Agents can gain from the dutch book by revising their beliefs in events using Bayesian updating when one of the events happens" • The Monty Hall problem is a problem when you are given a choice of three doors, with two doors having a goat behind them and one has a car behind it" • You choose a door and then one door is opened which is known to have a goat behind it" • You then get a chance to stick with the door you chose or switch to the other unopened door with the intention of choosing the one with the car behind" • Solving this problem involves determining the condition probability that each door is opened given that the car is behind each door" • Bayes’ theorem can then be applied to ﬁnd that there is a 2/3 probability of success when switching and a 1/3 probability of success when not switching" • The belief state is updated on the opening of the door by judging how likely that door was to be opened in that circumstance" • Speech Recognition • Speech recognition is a common problem in artiﬁcial intelligence which is solved in some restricted settings, such as when a small vocabulary is spoken" • However, error-free recognition of continuous, unrestricted speech remains an unsolved problem" • Current technologies are improving in accuracy but there is still a lot to gain from improving such systems"

• Current solutions use a probabilistic approach similar to robot localisation" • This is because speech is a highly varied input which changes depending on the voice of the person speaking" • This means that any model must be suﬃciently ﬂexible" • Sound is initially represented as an air pressure wave, but the sound wave in its raw form is not very useful for speech recognition systems" • More useful features can be extracted from a raw sound wave using a technique called signal processing" • The sound wave is ﬁrst split up into several overlapping segments and it is assumed that the segments are suﬃciently uniform that they can be represented by the same vector of features" • A good set of features is then found for each segment" • A popular technique is called Fourier transform which allows the dominant frequencies within a segment to be identiﬁed" • After some processing, a set of 13 Mel-frequency cepstrum coeﬃcients (MFCCs) are generated for each segment" • Each MFCC represents the contribution to the overall sound wave of a particular frequency" • A series of 13 dimensional vectors describing the sound wave is then obtains and these need to be processed to ﬁnd the meaning of the particular word or sentence" • For short words a segment can be directly labelled as a word but, for longer speech, it makes sense to break them into elementary parts of language called phonemes" • Each segment is classiﬁed as a particular phoneme based on its MFCCs" • There are a number of phonetic alphabets in use" • Sometimes it is useful to combine phonemes using sequences of three phonemes called a triphone" • When a system is not able to deﬁnitively diﬀerentiate parts of speech, it can use probabilistic language models to predict what they will be" • This is simply a model which predicts which words (or phonemes) are most likely depending on what has come before" • Models have to be trained which is where the parameter of the model is adjusted to ﬁt some example data known as training data" • It should then be tested on data independent of the training data to ensure a good model of speech has been built" • Feature Based Classiﬁers" • The problem of speech recognition is to take a feature vector and classify it into one of multiple categories which represent a word" • A simple way to do this is to just take the features for a pre-recorded word and ﬁt it to the classiﬁer" • This works for very simple speech models" • A classiﬁer is a system which takes some input data and ﬁts it into one of a number of discrete classes" • It is useful for a classiﬁer to represent its belief on the classiﬁcation as the probability that a particular piece of data belongs to the class it has chosen" • This allows classiﬁcations which are more certain and those which should be treated with more suspicion to be identiﬁed" • The goal of a classiﬁer is to make as few mistakes as possible, although it is impossible to make a perfect classiﬁer when dealing with noisy data" • A simple classiﬁer can use a single feature for each word by taking the average MFCC across each time segment" • This can be done as it is assumed to be one phoneme so the segments should be uniform"

• Normally, the feature will have a normal distribution for each class of data with a diﬀerent means and standard deviation" • By ﬁnding this distribution, it is possible to determine the probability that a piece of data from each class has a particular value for that feature" • The Bayesian formula can then be used to calculate the probability that it is in a particular class given it has a particular value for the feature" • This is calculated by multiplying the probability that a data item from that class has that value for the feature by the prior belief that a word is in that class and dividing it by the probability that an observation with that feature was made" • The prior belief is normally set as an equal probability for each class so will be equal to 1 divided by the number of classes" • The probability that a data item from a particular class has the speciﬁed feature value is obtained from the normal distribution" • The probability that an observation with that feature was made is calculated by taking the sum of the product of the prior belief of that class multiplied by the probability that a word in that class has that particular feature value across all classes being modelled" • This calculation should then be repeated for each class that the word could be in" • The sum of all the probabilities for each class will be equal to 1" • The highest probability will be what the word will be classiﬁed as" • The normal distribution for the feature on each class is generated through training" • This is the process of taking words which have a class which has already been labelled and ﬁnding the mean and standard deviation of all the feature for all the words belonging to each class" • This can be used to build the normal distribution for each individual class" • Classiﬁers can be improved by examining more than one feature" • For example, there are 13 MFCCs so a better classiﬁer would examine the values for all these as separate features instead of taking the average and using it as one feature" • The process is the same except, instead of obtaining the probability that a word in a particular class would have the given feature value straight from the normal distribution, a probability is obtained for each feature individually using its own normal distribution and these probabilities are multiplied together" • This requires a normal distribution to be build for each feature in each class" • This creates a classiﬁer called a naive Bayes classiﬁer which classiﬁes based on multiple features" • It is known as naive as it makes the assumption that all the features are conditionally independent given knowledge of the class" • In the case of speech recognition, this is generally correct" • Markov Chains • A Markov chain is also known as a ﬁrst order Markov process and is a model of how sequences can be generated in a random or probabilistic fashion" • A Markov model models sequences of events which occur in a state space and each sequence has a particular number of these states in an order" • In an nth order Markov process, the probability of a certain value in a sequence being a particular state depends on the state the n previous values were" • A zeroth order Markov process is a sequence of random events which are independent from each other" • A Markov chain is a ﬁrst order Markov process where the probability of an event being in a particular state is dependent on the state of the last event" • A homogenous Markov process is where the probability of a particular state occurring is only dependent on what the previous state was" • Since a Markov process is a sequence of states, it can be represented as a directed graph with nodes representing the stat...