Copy of 440 final 30-38 PDF

Title Copy of 440 final 30-38
Author Bayani Julian
Course Artificial Intelligence
Institution University of Illinois at Urbana-Champaign
Pages 38
File Size 1.4 MB
File Type PDF
Total Downloads 20
Total Views 146

Summary

Final Course Exam Study Notes pages 30-38...


Description

GOOD LUCK & GOD SPEED Y'ALL /// RIP

Final Exam Skills List

Final Exam Skills List Neural Nets Markov Decision Processes Reinforcement Learning Game Search Vector Semantics Search Constraint satisfaction problems Planning Configuration space Probability Naive Bayes Testing Modelling text data Naive Bayes Bayes nets Natural Language POS tagging HMMs Computer vision Classifiers Linear Classifiers Midterm 1 Midterm 1 Conflict Midterm 2 Midterm 2 Conflict

Last Semester’s Slides if you are too lazy to watch videos: https://courses.engr.illinois.edu/ece448/sp2018/lectures.html

GOOD LUCK & GOD SPEED Y'ALL /// RIP

Neural Nets ●

Basic design (i.e. connected set of linear classifiers): ○ Neuron



■ Neural Net ■ When people say "neural net," they usually mean a design with more than layer of units, including "hidden" units which aren't directly connected to the output.

■ ●

What kinds of functions can a neural net approximate? ○ In theory, a single hidden layer is sufficient to approximate any continuous function, assuming you have enough hidden units. ○ a neural net with increasingly many layers can approximate increasingly complex decision boundaries.

GOOD LUCK & GOD SPEED Y'ALL /// RIP ■



With 0 hidden layers (linear classifier) ● Hyperplanes



● With 1 hidden layer ● Boundary of convex region (open or closed)



● With 2 hidden layers ● Combinations of convex regions

● ○ To approximate a complex shape with only 2-3 layers, each hidden unit takes on some limited task. Training ○ Top-level (aka simple) update equation for a single weight in the network ■ ○

𝑤𝑖 = 𝑤𝑖 − α *

What does backpropagation compute? (Not the details of how the computation is done) ∂𝑙𝑜𝑠𝑠 ∂𝑤𝑖



Basically allows us to compute



For a weight w at some level of the network we can assemble the value for

■ ○

∂𝑙𝑜𝑠𝑠 ∂𝑤𝑖

∂𝑙𝑜𝑠𝑠 ∂𝑤𝑖

using the equation for derivative at our level plus the derivative

values computed at the previous level Backpropagation fills in the derivative values.

Symmetry breaking ■ One new issue in training is symmetry breaking. That is, we need to ensure that hidden units in the same layer learn to do different things. Suppose that two units have the same inputs, and also send their output to the same place(s). Then our basic linear classifier training will give

GOOD LUCK & GOD SPEED Y'ALL /// RIP



them the same weights. At that point, they will do the same thing, so we might as well just have one unit. ■ Two Approaches to solve this problem ● Set the initial weights to random values. ● Randomize some aspect of the later training, e.g. ignore some randomly-chosen units on each update. ○ Data augmentation ■ Training data is always very sparse. So make more training examples by perturbing existing ones in ways that shouldn't (ideally) change the network's output. For example, if you have one picture of a cat, make more by translating or rotating the cat. ○ Dropout ■ Within the network, each unit pays attention to training data only with probability p. On other training inputs, it stops listening and starts reading its email or something. This can help prevent overfitting. Convolutional neural networks ○ How does a convolutional layer work (e.g. connection pattern)?

■ ■



■ ■



Above is an example of a convolutional layer which is in blue. In red you can see the input volume, which is a 32x32 image with depth of 3 because of RGB values Because an image is so big, Each neuron in the convolutional layer is connected only to a local region in the input volume spatially (3x3 area maybe), but to the full depth (i.e. all color channels). Inside of convolution layer you can see there are multiple neurons (5 in this example) along the depth, all looking at the same region in the input These neurons compute a weighted sum of the values in that local region, which is called a mask or filter. masks can be used to apply different operations, such as: ● Edge Detection ● Sharpen ● Gaussian Blur one might imagine that a number of different convolution masks would be useful to apply, each picking out a different type of feature. So, in reality, each network layer has a significant thickness, this is why the convolution

GOOD LUCK & GOD SPEED Y'ALL /// RIP



layer above has the depth of 5, each one detecting a different type of feature. An example of one layer of processing might work

● ● ●





Here you can see the input value is a 7x7x3 image You can also see there are two classifier units, Filter 1 and Filter 2, with their weights represented in red. So this layer has a depth 2. ● In this example, each classifier unit is producing values only at every third input location. So the output is a 3x3x2 image because there are two filters. Used mostly outside sources to understand these: ● http://cs231n.github.io/convolutional-networks/ ● http://cs231n.github.io/assets/conv-demo/index.html ● https://medium.com/@RaghavPrabhu/understanding-of-convolutio nal-neural-network-cnn-deep-learning-99760835f148

What is a pooling layer? ■ Its a type of neural net layer reduces the size of the data, by producing an output value only for every kth input value in each dimension.

GOOD LUCK & GOD SPEED Y'ALL /// RIP ■ ■



The output values may be either selected input values, or the average over a group of inputs, or the maximum over a group of inputs. This kind of reduction in size ("downsampling") is especially sensible when data values are changing only slowly across the image.

■ Overall architecture, e.g. what kinds of features are detected in early vs. late layers?

■ ■

A complete CNN typically contains three types of layers ● convolutional (large input/output, small masks) ● Pooling ● fully-connected (usually towards end of the computation)

GOOD LUCK & GOD SPEED Y'ALL /// RIP

■ ■





in Image Classification, a ConvNet may learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features, such as facial shapes in higher layers. With the fully connected components at the end calculating probabilities and classifying the probable class, as seen in the picture above. ○ What is a CNN typically used for? Why is it a better choice than a fully-connected network? ■ Convolutional neural nets are a specialized architecture designed to work well on image data (also apparently used somewhat for speech data). ■ The large size of each layer makes it infeasible to connect units to every unit in the previous layer. Full interconnection can be done for artificially small (e.g. 32x32) input images. For larger images, this will create too many weights to train effectively with available training data. What (briefly) is a ○ Recurrent neural network: A neural net with connections that loop back from a layer to the same or a previous layer ○ Generative adversarial neural network: Consists of 2 neural nets that learn a model of input data ■ Classifier tries to distinguish real training images from similar fake images ■ Adversary tries to produce convincing fake images Adversarial examples ○ Current training procedures for neural nets still leave them excessively sensitive to small changes in the input data. So it is possible to cook up patterns that are fairly close to random noise but push the network's values towards or away from a particular output classification. Adding these patterns to an input image creates an "adversarial example" that seems almost identical to a human but gets a radically different classification from the network.

GOOD LUCK & GOD SPEED Y'ALL /// RIP



Markov Decision Processes ●



Model and terminology for an MDP ○ Basic Info: ■ The basic set-up is an agent (e.g. imagine a robot) moving around a world like the one shown below. Positions add or deduct points from your score. Its goal is to accumulate the most points. ■ This could be a finite game but it's typically best to imagine a game that goes on forever. So ending or goal states reset the game back to the start. ■ Actions are executed with errors. So if our agent commands a forward motion, there's some chance that they will instead move sideways. ○ State: The status of the agent ○ Actions: The transition options an agent has to take it to a new state ○ Reward Function: returns the rewards for a given state ■ The reward function could have any pattern of negative and positive values. However, the intended pattern is: ● A few states have big rewards or negative consequences. ● The rest of the states have some small background reward, often constant across all of these background states. ○ Transition Function/Probabilities: gives the probability that you go from one state to another ○ Policy Function π(𝑠) : returns the action to take for a given state ■ A good policy maximizes reward over time ● So something like the sum of P(sequence)U(sequence) over all sequences of states. ● U(sequence) is the utility of the sequence of states, i.e. the sum of the rewards. ● P(sequence) is how often this sequence of states happens. Bellman equation: ○ A good way to think about the best policy is the "Bellman equation" which relates the utility at each state s to the utilities at adjacent states

GOOD LUCK & GOD SPEED Y'ALL /// RIP







∑ 𝑃(𝑠' | 𝑠, 𝑎) 𝑈(𝑠')

𝑈(𝑠) = 𝑅(𝑠) + γ𝑚𝑎𝑥

𝑎∈𝐴

𝑠'∈𝑆

■ U(s): utility of a state ■ R(s): reward of a state ■ γ : reward decay ■ s’ : next state Using the reward decay causes which makes the rewards for the neighboring states to be considered less important than the immediate reward R(s). It also causes the system of equations to converge. Then the Bellman equation for this fixed policy is simpler because we know exactly what action we'll command ■

𝑈(𝑠) = 𝑅(𝑠) + γ ∑ 𝑃(𝑠' | 𝑠, 𝑎) 𝑈(𝑠') 𝑠'∈𝑆



Methods of solving the Bellman equation ○ Value iteration ■ The simplest solution method is value iteration. This method repeatedly applies the Bellman equation to update utility values for each state. ■ Suppose 𝑈 (𝑠) is the utility value for state s in iteration k. Then: 𝑘



initialize𝑈 (𝑠) = 0 for all states s



loop for 𝑘 = 1 until the values converge:

1

○ ■

(𝑠) = 𝑅(𝑠) + γ𝑚𝑎𝑥

𝑘+1

𝑎∈𝐴

∑ 𝑃(𝑠' | 𝑠, 𝑎) 𝑈 (𝑠') 𝑠'∈𝑆

𝑘

From these utility values, we can read off a final policy using the equation: ●



𝑈

π(𝑠) = 𝑎𝑟𝑔𝑚𝑎𝑥 ∑ 𝑃(𝑠' | 𝑠, 𝑎) 𝑈𝑖(𝑠) 𝑎

𝑠'

Policy iteration: ■ Policy iteration produces the same solution, but faster. It operates as follows: ● Start with an initial guess for policy π ● Alternate two steps ○ Policy evaluation: use policy π to estimate utility values 𝑈 ■

𝑈(𝑠) = 𝑅(𝑠) + γ ∑ 𝑃(𝑠' | 𝑠, 𝑎) 𝑈(𝑠') 𝑠'∈𝑆





Do a couple iterations of value estimation using the formula above. We don't need an exact (fully converged) solution, because we'll be repeating this calculation each time we refine our policy π Policy improvement: use utility values 𝑈 to calculate a new policy π like in value iteration:

GOOD LUCK & GOD SPEED Y'ALL /// RIP

■ ○

π(𝑠) = 𝑎𝑟𝑔𝑚𝑎𝑥 ∑ 𝑃(𝑠' | 𝑠, 𝑎) 𝑈𝑖(𝑠) 𝑎

𝑠'

Asynchronous dynamic programming: ■ One useful weak to solving Markov Decision Process is "asynchronous dynamic programming." In each iteration, it's not necessary to update all states. We can select only certain states for updating. E.g. ● states frequently seen in some application (e.g. a game) ● states for which the Bellman equation has a large error (i.e. compare values for left and right sides of the equation) ■ The details can be spelled out in a wide variety of ways.

Reinforcement Learning ●



Basic Reinforcement Learning Info ○ Pretty much the same as MDP seen above but our learning agent doesn't know P or R. So the agent has to start taking actions with incomplete (initially almost no) information, hoping to gradually learn how to perform well. ○ We typically imagine that it does a very long sequence of actions, returning to the same state multiple types. There is no permanent end state, so hazards or losses send it back to some initial state. ○ Learning loop: ■ Take an action ■ observe the outcome (state and reward) ■ update internal representation Model-based reinforcement learning: ○ A model based learner operates much like a naive Bayes learning algorithm, except that it has to start making decisions as it trains. Initially it would use some default method for picking actions. As it moves around taking actions, it tracks counts for what rewards it got in different states and what state transitions resulted from its commanded actions. Periodically it uses these counts to: ■ Estimate𝑃(𝑆' | 𝑠, 𝑎) and 𝑅(𝑠) , and then use these values to ■ Estimate 𝑈(𝑠) and π(𝑠) as above: ●

𝑈(𝑠) = 𝑅(𝑠) + γ ∑ 𝑃(𝑠' | 𝑠, 𝑎) 𝑈(𝑠') 𝑠'∈𝑆

● ●

π(𝑠) = 𝑎𝑟𝑔𝑚𝑎𝑥 ∑ 𝑃(𝑠' | 𝑠, 𝑎) 𝑈𝑖(𝑠) 𝑎

𝑠'

Model-free reinforcement learning: ○ Q-learning version of Bellman equation (expressing Q in terms of itself, without reference to the utility or transition probability functions) ■ Background:

GOOD LUCK & GOD SPEED Y'ALL /// RIP For Q learning, we split up and recombine the Bellman equation and remove the references to utility or transition probability functions So we can estimate Q(s,a) by averaging the following quantity over an extended sequence of timesteps t. We need these estimates for all pairs of state and action. In a realistic training sequence, we'll move from state to state, but keep returning periodically to each state. So this averaging process should still work, as long as we do enough exploration. The Equation: ● 𝑄 (𝑠, 𝑎) = 𝑅(𝑠) + γ 𝑚𝑎𝑥 𝑄(𝑠', 𝑎') ●



𝑡



𝑎'

TD update algorithm: ■ 𝑄(𝑠, 𝑎) = 0 for all s and all a ■ For t=1 to forever: ● 1. in the current state s, select an action a ● 2. observe the reward R(s) and the next state s' ● 3. Update using: 𝑄 (𝑠, 𝑎) = 𝑄 (𝑠, 𝑎) + α[𝑅(𝑠) − 𝑄 (𝑠, 𝑎) + γ 𝑚𝑎𝑥 𝑄(𝑠', 𝑎')] 𝑎'



For step one above, sometimes explore, otherwise do best move ● Best Move: π(𝑠) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑄(𝑠, 𝑎) 𝑎





Explore: Otherwise choose a random or semi-random exploration state. SARSA update algorithm: ■ pick an initial state s and initial action a ■ For t=1 to forever: ● 1. observe the reward R(s) and the next state s' ● 2. from next state s', select next action a' ● 3. Update using: 𝑄 (𝑠, 𝑎) = 𝑄 (𝑠, 𝑎) + α[𝑅(𝑠) − 𝑄 (𝑠, 𝑎) + γ 𝑄(𝑠', 𝑎')] ■

● (𝑠, 𝑎) = (𝑠', 𝑎') For step two above, sometimes explore, otherwise do best move ● Best Move: π(𝑠) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑄(𝑠, 𝑎) 𝑎





Explore: Otherwise choose a random or semi-random exploration state. How do TD and SARSA differ? ■ The SARSA ("State-Action-Reward-State-Action") algorithm adjusts the TD algorithm to align the update with the actual choice of action. ■ The agent is prone to occasional random motions, because it likes to explore rather than just following its optimal policy. The TD algorithm assumes it will do a better job of following policy, so it sends the agent along the edge of the hazard and it regularly falls in. The SARSA agent

GOOD LUCK & GOD SPEED Y'ALL /// RIP





stays further from the hazard, so that the occasional random motion isn't likely to be damaging. The role of exploration in action selection (aka exploration vs. exploitation) ○ This obvious implementation of a model-based learner is risk-averse. Once a clear policy starts to emerge, it has a strong incentive to stick to its familiar strategy rather than exploring the rest of its environment. So it could miss very good possibilities (e.g. large rewards) if it didn't happen to see them early. ○ The probability of exploring would typically be set high at the start of learning and gradually decreased, allowing the agent to settle into a specific final policy. Online, offline, experience replay: ○ Online: interacting directly with the world ○ Offline: interacting with a simulation of some sort. ○ Experience Replay: an online learner. But instead of forgetting its old experiences, it will: ■ remember old training sequences, ■ spend some training time replaying these experiences rather than taking new actions in the world

You've done an MP using Q-learning with TD update, so you should have a detailed understanding of how it works.

Game Search ●

Game tree: ○ Game tree for a game of tic-tac-toe showing all possible moves that can be made:



GOOD LUCK & GOD SPEED Y'ALL /// RIP ●



Zero-sum games: ○ Final rewards for the two (or more) players sum to a constant. That is, a gain for one player involves equivalent losses for the other player(s) This would be true for games where there are a fixed number of prizes to accumulate, but not for games such as scrabble where the total score rises with the skill of both players. Basic method: ○ Minimax strategy: ■ Build out game tree to depth d. ■ At depth d, use heuristic function to evaluate game positions. ■ Use minimax to propagate values from depth d up to root of tree. ● On levels where we move, take the maximum of the child values ● On levels where opponent moves, take the minimum of the child values ■ Pick best move available to us at the root node ■ Example:

● ●







The algorithm generates the tree on the right, where the circles represent the moves of the player running the algorithm (maximizing player), and squares represent the moves of the opponent (minimizing player). The algorithm evaluates each leaf node using a heuristic evaluation function, obtaining the values shown. The moves where the maximizing player wins are assigned with positive infinity, while the moves that lead to a win of the minimizing player are assigned with negative infinity. At level 3, the algorithm will choose, for each node, the smallest of the child node values, and assign it to that same node. The next step, in level 2, consists of choosing for each node the largest of the child node values. The algorithm continues evaluating the maximum and minimum values of the child nodes alternately until it reaches the root node, where it chooses the move with the largest value (represented in

GOOD LUCK & GOD SPEED Y'ALL /// RIP





the figure with a blue arrow). This is the move that the player should make in order to minimize the maximum possible loss. ● This procedure is called "minimax" and is optimal against an optimal opponent. Usually our best choice even against suboptimal opponents unless we have special insight into the ways in which they are likely to play badly. Minimax search (using depth-first search): ■ This video explains it nicely: ● https://www.youtube.com/watch?v=l-hh51ncgDI ● At around 6:00 mins is where the pruning is explained. ■ Algorithm: ● Do recursive depth-first search, with depth limit d. ● When we hit a node at depth d, use heuristic function to evaluate the game position. ● As we recurse upwards, minimax to propagate values upwards ● Pick best move available to us at the root node ■ Pseudocode: ● Suppose that move(node,action) is the state/node that res...


Similar Free PDFs