CS221 Notes PDF

Title	CS221 Notes
Course	Artificial Intelligence: Principles And Techniques
Institution	Stanford University
Pages	24
File Size	312.5 KB
File Type	PDF
Total Downloads	51
Total Views	134

Preview

CLICK TO PREVIEW PDF

Summary

Notes I took as a student in Percy Liang's class at Stanford University, CS221: Artificial Intelligence...

Description

Lecture 1 9/24/18: - AI growing in industry, government, academia, etc - Intelligent agents have perception, robotics, language, knowledge, reasoning, learning - Lots of work ongoing on lots of open questions with great potential to impact society - Paradigms of the class: - Modeling: Figuring out what to keep and not keep when modeling can be an art form - Inference: will need to infer things from results - Learning: take data and perform learning on it to get out a model with parameters - Reflex-based models are most common, feed-forward, no backtracking - Linear classifiers, deep neural networks, etc - State-based models slightly more complex, still pretty common - Playing games, robotics motion planning, natural language generation - Variable-based models more complex still - Constraint satisfaction problems, Bayesian networks - Logic is most complex, highest intelligence, can reason deeply with digested info - Discrete optimization can be helped by using dynamic programming - Continuous optimization helped by using gradient descent - Dynamic programming example: - Given 2 strings, s and t, output minimum number of character insertions/deletions/substitutions to change s into t - One method is to start at the end and try to start making changes until we find matches - Note that insertion into s is the same as deletion from t - s=”a cat!”, t=”the cats!” - Can create a search tree by trying out all 3 different possible actions (insert/delete/substitute) when we do not have an exact match, end up choosing path with lowest cost - See notebook for coded example - Gradient descent example - Input is a set of coordinate pairs, output is a value w that minimizes the squared error/cost function F(w) = sum^n_{i=1} (xiw - yi)2 - Overall is just taking small steps until you reach an optimal solution - Coded in notebook Lecture 2 9/26/18: - Focus today on machine learning - how to we learn models - Especially focusing on reflexive models (esp. linear predictors) - Example: spam classification - given text, is message spam? - Types of prediction tasks: - binary classification (yes/no, 0/1, 1/-1) - Regression (continuous data, predict value) - Multiclass classification (y is a category)

-

-

-

-

-

-

-

- Ranking (y is a permutation of inputs) - Structured prediction (translation, y built from parts) Starting point of any ML attempt is the data Supervised learning needs data in input/output pairs to train model Notation: - Data: fancyDtrain - Predictor: f Feature extraction: - Example: predicting y, whether or not a string x is an email address - What properties of x are relevant? - A feature extractor might come up with things like length, @ symbol, ending with .com, fraction of letters, etc. Feature vector (\phi(x)) takes an input x and maps it into a list of numbers, returns a vector with components representing different features of x Weight vector (w) has a weight wj for each feature j representing the contribution of that feature to the overall prediction Score is simply the dot product of the feature vector and the weight vector For binary classification, our predictor is the sign of the score: if the score is positive return +1, if negative return -1, if zero then we’re not sure - If we visualize it with a graph and draw the weight vector, a hyperplane orthogonal to the weight vector divides the +1/-1 spaces - That line/plane/hyperplane is called the decision boundary, divides points of different labels from one another A learner takes in training data and tries to go from an optimization problem to an optimization algorithm to produce a predictor Loss function(x, y, w) is what we want to minimize (difference between prediction and true label given input x) Binary classification: - Score (\phi(x) • w) is how confident we are in predicting +1 - Margin on example (x, y) is (\phi(x) • w)y and represents how correct we are - 0-1 loss for binary classifiers is Loss0-1(x, y, w) = 1[fw(x) ≠ y] = 1[(w • \phi(x))y ≤ 0] - Indicator function, 1 if prediction is correct, 0 if not Linear regression: - Residual: amount by which prediction overshoots the true score y, (w • \phi(x)) - y - Squared loss: Losssquared  (x, y, w) = (fw (x) - y)2 - Absolute deviation loss: Lossabsdev(x, y, w) = |w • \phi(x) - y| The key in training models is to minimize the loss on training data - TrainLoss(w) = average loss for all (x, y) points in training data - Need to set w to make global trade offs into overall minimizing TrainLoss - Least squares regression (L2) is called mean y, tries to accommodate everything, would be greatly swayed by outliers - Least absolute deviation regression (L1) is called mean y, more robust to outliers Now, to optimize TrainLoss, we can use gradient descent

-

-

-

-

Gradient gives direction of steepest change (that increases loss the most) Gradient descent: start w at all 0’s, then repeat w = w - \eta * \delwTrainLoss(w) where \eta is the step size - Move in opposite direction from gradient to go in the direction that decreases loss the most Coded gradient descent example in notebook - Good way to test is to generate a synthetic dataset that you know what the answer should be (make up weights, and create points based on those weights) - Batch gradient descent is slow though for large datasets, because we sum up all 100,000 data points for every step - In comes stochastic gradient descent to take a step for each training example What should step size \eta be? - If it’s close to 0, we’re being very cautious and taking small steps - If it’s too big, we’ll take large steps and may be over aggressive and overshoot - One solution is to start a little bigger and then decrease it over time What about for zero-one loss? - Gradient doesn’t really work for it - Surrogate losses - Hinge loss: Losshinge(x, y, w) = max{1 - (w • \phi(x))y, 0} - Hinge loss upper bounds 0-1 loss, has a non-trivial gradient - Gradient has 2 cases: 0 (if 0 > 1 - (w • \phi(x))y), -\phi(x)y (otherwise) - Logistic regression: function is some crazy logarithmic shit lol peep slides

Section 1 9/28/18: - Matrix properties - (A + B)T = AT + BT - (AB)T = BT AT - f(w) = (a • w + 1)2 + b||w||2 + wT Cw - \delw(a • w) = a - \delwwTCw = (C + CT )w - \delw||w||2 = 2w Lecture 3 10/1/18: - 2 big components of predictions w and \phi(x) - score = w • \phi(x) - We’ve already seen how to optimize w given a set \phi(x) - But how do we get a good \phi(x) - Feature extraction - Feature templates - Groups of features all computed in a similar way - Ex: length greater than _____, contains character ___, etc - Actual feature would be one of these filled in

- ex: feature name=”contains char ‘a’”, feature value=1 Feature vectors are typically sparse, so we can represent them as a map of features name to value - leave out features with value 0 to save space - can use array for dense features - Hypothesis class: set of all possible predictors with a fixed \phi(x) and varying w - Represented with script F: F = {fw : w \in \Rd } - For linear functions, \phi(x) = x, F = {x → w1x + w2x2: w1 \in \R, w2 = 0} - Quadratic functions have \phi(x) = [x, x2 ], F = {x → w1x + w2x2: w1 \in \R, w2 \in \R} - Can even do piecewise functions too - Feature extraction: set F based on domain knowledge, set fw (\in F) based on learning - Score w • \phi(x) is linear in w and linear in \phi(x), but not necessarily linear in x - Therefore, predictors can be expressive non-linear functions and decision boundaries of x Neural networks - Motivation: input: positions of 2 oncoming cars x = [x1 , x2 ], output is +1 (safe) or -1 (crash) - True function y = sign(|x1 - x2 | - 1) - Can decompose into 2 tests: - Car 1 far right of car 2: h1 = 1[x1 - x2 ≥ 1] - Car 2 far right of car 1: h2 = 1[x2 - x1 ≥ 1] - y = 1 if at least 1 of h1 and h2 is true - Rewrite these as if we do not know the true function - \phi(x) = [1, x1 , x2 ] - h1 = 1[v1 • \phi(x) ≥ 0] - h2 = 1[v2 • \phi(x) ≥ 0] - fV, w(x) = sign(w1h1 + w2h2) - Joint learning has the goal of learning hidden subproblems V = [v1, v2] and combination weights w = [w1, w2] - However, gradient of h1 wrt v1 is always 0 - Common way to get around this is using the logistic/sigmoid function \sigma(z) = (1 + e-z  )-1 (note that \sigma’(z) = \sigma(z)(1 \sigma(z))) - Then we can redefine h1 = \sigma(v1 • \phi(x)) - Then simple neural network is using V to get from \phi(x)i to hj, and then using w to get from hj’s to a score - Feature learning here is essentially applying our linear predictors on the automatically learned features from V - Still have to do loss minimization - minV, w TrainLoss(V, w) - TrainLoss(V, w) = 1/(num training points) * \sum Loss(x, y, V, w) -

-

-

-

- Loss(x, y, V, w) = (y - fV, w(x))2 - fV, w(x) = \sumkj=1 (wj*\sigma(vj•\phi(x)) - Then just compute gradient of TrainLoss and solve Visually can graph it out and multiply edges to get gradient wow Forward/backward values - Forward: fi is value for subexpression rooted at i - Backward: gi = d(out)/dfi is how fi influences output Backpropagation - Forward pass: compute fi from leaves to root - Backward pass: compute each gi from root to leaves - If fi is below fj, then gi = d(fj)/dfi * gj

Lecture 4 10/3/18: - Rote/strawman learning algorithm: store training data set, then when tested, return y if x is in the training data, otherwise segfault if you haven’t seen it yet - We don’t want to focus too strongly on the training data — this is overfitting - The actual goal is to perform as well as possible on future testing data - Typically safe out a test set fancyDtest  with examples that you do not use in training - Terminology - f* would be the perfect predictor - F is the hypothesis class from feature extraction - g is the predictor in F closest to f* - Approximation error is the distance between f* and g - Represents how good is the hypothesis class? - f-hat is the learned predictor we actually use - Estimation error is the distance between f-hat and g - Represents how good is the learned predictor relative to the hypothesis class? - As the hypothesis class size increases (gets more complex), approx. error decreases because the set grows, but est. error increases, because it’s harder for the learned predictor to match something more complex - To avoid overfitting we can control the size of the hypothesis class by keeping dimension d of data low, and keeping norm/length of w low - Controlling norm: regularization - Actually minimize TrainLoss(w) + \lambda/2 * ||w||2 - Add regularization term to let w stay relatively small - Can do this with gradient descent still - Same as gradient descent before but you also try to shrink the weight vectors toward 0 by a factor of some \lambda amount - Controlling norm: early stopping - Do gradient descent and just stop earlier, before everything necessarily converges - Keeps weights lower if you initialize them to 0 originally

-

-

-

-

Hyper parameters: properties of learning algorithm (features, regularization parameter \lambda, number of iterations T, step size \eta, etc) Validation/development set: take out some of training data to use as test set - Use this smaller set to test out hyperparameters so you aren’t tuning them on either train or test set Development Cycle: recipe for success - Split data into training set, dev set, test set - Look at data to get intuition - Repeat: - Implement feature / tune hyperparams - Run learning algorithm - Sanity check train and dev error rates and the weights - Look at errors to brainstorm improvements - Run on test set to get final error rates Unsupervised learning time! - No labels - only input x, no y - Much cheaper to obtain - Often involves looking at features and grouping/clustering data based on common features - Key idea: data has lots of rich latent structures; want methods to discover these structures automatically Clustering: goal - similar points in same cluster, dissimilar points in different clusters K-means: objective - each cluster is represented by a centroid \muk \in \Rk such that each point is close to its assigned centroid 2 - Losskmeans(z, \mu) = \sumn i=1||\phi(x  i) - \muz_i||  - Need to jointly choose centroids \mu and assignments z - If we know the locations of the centroids, it’s easy to just choose whichever centroid is closest to each point - If we know what the assignments are, it’s easy to place the centroids at the center of its cluster - But getting both at the same time is hard - Use alternating minimization to tackle hard problem by solving 2 easy problems - Step 1: given centroids \mu1,...,\muk, assign each point to the nearest cluster - Step 2: set centroid to average of points assigned to that cluster - K-means algorithm: initialize centroids randomly, then alternate between steps 1 and 2 until no more centroid movement/points switching centroids - Choosing k: balance between overfitting and underfitting, can try out a few different values and see what works best - Guaranteed to converge to a local minimum, but not necessarily to a global minimum - Sometimes just run multiple times with different initializations to make sure we get the least of them

-

Can also deliberately start centers farther apart so they don’t interfere with one another

Lecture 5 10/8/18: - Search! Ex: crossing river with goat/wolf/cabbage, route finding, robot motion planning, solving puzzles, machine translation - Reflex based models took input x and produced single action y - Search problems produce an action sequence that will help you achieve your goal - Can enumerate actions by exploring every single possible combination of actions and continuing until you reach your goal or reach in invalid state - sstart: start state, Actions(s): possible actions, Cost(s, a): action cost, Succ(s, a) possible successor actions, IsEnd(s): reached end state? - Backtracking search is DFS, goes all the way down a certain path until it can’t go down any further, then goes back up until there is a spot - Implemented recursively (recurse on first action, then on second, etc.) - DFS basically assumes costs are 0 - D = depth, b = actions per state/branching factor - Memory: O(D), time: O(bD ) - BFS assumes costs constant for all nodes - b = actions per state, d = depth of solution - Memory: O(bd ), time: O(bd ) - Finds optimal cost solution but much worse memory performance - Combination: Do DFS for only first 1 level, then first 2 levels, then 3 levels, etc - Reap benefits of both - ensure optimal solution and use less memory - Memory: O(d), time: O(bd ) - Dynamic programming - Curr state s, Cost(s, a) to take an action to next state s’, then FutureCost(s’) as min cost to get from s’ to end - Goal of DP is to eliminate redundant substructures - calculate once and cache results - Important assumption is that the graph is acyclic - Dijkstra’s/Uniform Cost Search - Explores cheapest cost path first - Tracks visited node because no point exploring a path where you’ve reached further along it on a cheaper path - Important that all costs are ≥ 0 Lecture 6 10/10/18: - State: summary of all past actions sufficient to choose future actions optimally - Search knows costs of different actions, produces optimal set of actions - Learning does not know cost, uses series of actions to infer costs - Inverse problem of search - Tweaking costs

-

-

-

-

Look at actions, assume they are ideal for that person Start with some costs, then tweak them until they match what the person’s actions were Assumptions: costs assume only on action: Cost(s, a) = w[a], total path cost = sum of costs of actions Structured perceptron algorithm: - Initialize w[a] ← 0 for all actions - For each iteration t = 1, …, T - For each training example (x, y) \in fancyDtrain - Compute the minimum cost path y’ given our current costs w - For each action a \in y: w[a] ← w[a] - 1 - For each action a \in y’: w[a] ← w[a] + 1 - If there is a set of weights that actually gets 0 loss on training examples, this algorithm is guaranteed to converge to that set of weights (or a scaled version of the set of weights) - Can generalize to have it based on features too A* search - Uniform cost search still went out equally in all directions (that seemed cheap) - A* prioritizes paths that are cheap and going in the right direction (according to some heuristic) - Heuristic function h(s) is an estimate of FutureCost(s) - Can be linear distance to the goal or something like that - Explore paths in order of PastCost(s) + h(s) - Dijkstra’s/UCS essentially has h(s) = 0 - Can redefine costs of nodes in A* to get it so it is essentially UCS - Path A-->B-->C-->D-->E (costs c1 , c2 , c3 , c4 ) - Transform costs to c1’ = c1 - h(A) + h(B), etc. - Intermediate h(s) terms cancel out - Except for initial -h(A), but all nodes have that, so it cancels out - Basically just adding a penalty for going in the wrong direction - Not any heuristic will work, has to somewhat approximate FutureCost - Heuristic h needs to be consistent: Cost’(s, a) = Cost(s, a) + h(Succ(s, a)) - h(s) ≥ 0 and h(send) = 0 - Can never overestimate actual future cost (yields negative adjusted cost) - Always finds minimum cost path - A* explores all states s satisfying PastCost(s) ≤ PastCost(send) - h(s) (pretty efficient!) - Distortion - A* distorts edge costs to favor end states - Admissibility: heuristic h(s) is admissible if h(s) ≤ FutureCost(s) - If h(s) is consistent, then it is also admissible Relaxation - relax/get rid of constraints and use heuristics to make things easier - Go through process with relaxed constraints

-

-

Compute relaxed FutureCostrel(s) for all locations, using the relaxed version as a heuristic for the constrained problem Costs in relaxed search problem must have every single cost be less Therefore the heuristic of the true FutureCostrel you compute for the relaxed problem is consistent when you use it to estimate the FutureCost in the constrained version If we have multiple heuristics, we can combine them by taking the max of them

Lecture 7 10/15/18: - Search problems gave deterministic Succ functions for where you would definitely end up if you took a certain action from a certain state - In the real world, things can go wrong and if we take a certain action from a certain state, we could randomly end up in either of a couple states - Markov Decision Process overview: - Ex: dice game - for each round r = 1, 2, … - Choose stay or quit - If quit get $10 and end game - If stay, get $4 and roll 6-sided die, end if 1 or 2 otherwise continue - Can represent dice game as graph like in search problems where actions take you to states - Have additional chance nodes where there are multiple edges out from it with different probabilities and their payoffs/rewards - Same definitions we had before for search problems, except Succ(s, a) becomes T(s, a, s’) and Cost(s, a) becomes Reward(s, a, s’) - Transition probabilities T(s, a, s’) specify the probabilities of ending up in state s’ if action a is taken from state s - All probabilities in T(s, a, s’) must sum to 1 - Successors are all s’ s.t. T(s, a, s’) > 0 - What is a solution to MDP? - Policy (\pi): a mapping from each state s to an action a that specifies how you should act from each state - Utility: following a policy yields a random path, utility is the sum of rewards from that path (can vary from time to time based on random chance) - Value: expected utility over time based on probabilities - Discounting - MDP has discount factor \gamma, utility is actually the discounted sum of rewards: u = r1 + \gamma r2 + \gamma2 r3 + … - Represents discount of having to continue to take more repeated actions - \gamma = 1: no discounting, \gamma = 0: only take r1 - Value of a policy: V\pi(s), expected utility of following policy \pi from state s...