Compsci 367 2017 Assignment 2 - Solutions PDF

Title	Compsci 367 2017 Assignment 2 - Solutions
Course	Artificial Intelligence
Institution	University of Auckland
Pages	6
File Size	204 KB
File Type	PDF
Total Downloads	60
Total Views	153

Preview

CLICK TO PREVIEW PDF

Summary

Download Compsci 367 2017 Assignment 2 - Solutions PDF

Description

Department of Computer Science COMPSCI 367 Artificial Intelligence

Assignment 2

due October 1, 2017

Note: Please create a file A2.pdf containing your answers to ALL questions below. In the pdf file, clearly indicate the solution to each question. Submit the file via the CS Assignment Dropbox by 23:59 on the due date. Late assignments will not be marked. The total mark for this assignment is 40. This assignment counts towards 10% of your final mark. The work done on this assignment must be your own work. Think carefully about any problems you come across, and try to solve them yourself before you ask anyone else for help. Under no circumstances should you work together with another student on any code used in this assignment.

1. We have 2 opaque boxes, each containing 3 balls. One box has 2 red balls and 1 blue ball, the other has 1 red ball and 2 blue balls. You pick a box at random and then pick one of the balls in that box at random. When you look at the ball it is red. (6 marks) (a) What is the probability that the remaining two balls in this box have the same colour? (b) You then randomly pick a ball from the remaining two balls in this box, and it turns out to be blue. What is the probability that the last ball in this box is also blue? Show working. (Hint: You may use Baye’s theorem to solve the problems.) Answer. Let A be the hypothesis “you picked the box with 2 red balls and 1 blue ball”. Let B be the hypothesis “the first picked ball is red”. Let C be the hypothesis “the second picked ball is blue”. (a) We know that P (A) = 1/2, P (B | A) = 2/3, and P (B | ¬A) = 1/3. We have P (A | B) =

P (A ∧ B) P (B)

P (B | A) · P (A) P (B | A) · P (A) + P (B | ¬A) · P (¬A) 1/3 = 1/3 + 1/6 = 2/3 =

Therefore the probability that the remaining two balls have the same colour is roughly 33.3%. (b) We know that P (C | A ∧ B) = 1/2, and P (C | B) = P (C ∧ A | B) + P (C ∧ ¬A | B) = P (C | A ∧ B) × P (A | B) + P (C | ¬A ∧ B) × P (¬A | B) = 1/2 × 2/3 + 1 × 1/3 = 2/3

COMPSCI 367 Artificial Intelligence

Assignment 2

Page 1 of 6

Then we have P (C | A ∧ B) × P (A | B) P (C | B) 1/2 × 2/3 = 2/3

P (A | B ∧ C) =

= 1/2 This means that the probability that the last ball is blue is 50%. 2. Calculate the Shannon entropy for each of the following:

(6 marks)

(a) Pixel values in a grey-scale image whose possible grey values are all the integers from 0 to 255 with uniform probability. (b) Gender in a tri-sexual insect population whose three genders occur with probabilities 1/4, 1/4, and 1/2, respectively. (c) A feature X with domain {1, 2, 3, 4, . . .} where the probability of X = i is 2−i . Answer. (a) 256 × (1/256 log 2 256) = 8 (b) −(2 × 41 log2 41 + 12 log2 21 ) = 3/2 P (c) H(X) = ∞ i2−i Pi=1 n Let Sn = i=1 i2−i . Then Sn − 21Sn = 2−1 + 2−2 + · · · + 2−n − n2−(n+1) = 1 − (n + 2)2−(n+1). Therefore Sn = 2 − (n + 2)2−n . Thus we have H(X) = lim Sn = lim (2 − (n + 2)2−n ) = 2 n→∞

n→∞

3. Suppose we run an online news aggregation platform which observes a person’s habits of reading online news articles in order to recommend other articles the user might want to read. Suppose that we have characterised each online article by whether it mentions celebrity, whether it features sports, whether it features politics, and whether it features technology. Suppose we are given the examples below about whether a specific user reads various online articles. We want to use this data set to train a model that predicts which article this person would like to read based on the mentioned features. Example e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13

Celebrity false true false true true false false false true false true true false

COMPSCI 367 Artificial Intelligence

Sports false true false false false true true true true false true false false

Politics true false false false false false true true true true false true false

Assignment 2

Technology true true false false true false true false false false false false true

Reads true false false true false false true false true false true true false

Page 2 of 6

(a) Suppose you would like to use decision tree learning for this task. Apply the TDIDT algorithm to learn a decision tree. Stop extending a branch of the decision tree when all remaining training examples have the same target feature. Demonstrate how you compute the information gain for the features at each node. Draw the resulting decision tree. (7 marks) Answer. Here is the working for deciding the root: 6 7 ) = 0.9957 • H(Reads) = −(6/13 log 2 13 + 7/13 log2 13 4 • H(Reads|Celebrity = true) = −( 46 log2 6 + 62 log2 26 ) H(Reads|Celebrity = f alse) = −( 27 log2 72 + 57 log2 75 ) 6 So I(Reads, Celebrity) = H(Reads) − ( 13 H(Reads|Celebrity = true) + 7 H(Reads|Celebrity = f alse)) = 0.1027 13 • H(Reads|Sports = true) = −( 36 log2 63 + 36 log2 63 ) H(Reads|Sports = f alse) = −( 37 log2 73 + 47 log2 74 ) I(Reads, Sports) = H(Reads) − (6/13H(Reads|Sports = true) + 7/13H(Reads|Sports = f alse)) = 0.0037 • H(Reads|P olitics = true) = −( 46 log2 64 + 62 log2 62 ). H(Reads|P olitics = false) = −( 27 log2 72 + 57 log2 75 ) 6 7 H(Reads|P olitics = I(Reads, P olitics) = H(Reads) − ( 13 H(Reads|P olitics = true) + 13 f alse)) = 0.1027 • H(Reads|T echnology = true) = −(52 log2 52 + 35 log2 53 ) H(Reads|T echnology = f alse) = −( 48 log2 84 + 48 log2 84 ) 6 I(Reads, T echnology) = H(Reads) − ( 13 H(Reads|T echnology = true) + 7 H(Reads|T echnology = f alse)) = 0.0089 13

We thus choose the feature that gives us the highest information gain, namely Celebrity. We then divide the dataset into two parts, those with Celebrity. Continue this process, we obtain the following decision tree.

(b) Suppose you would like to use naive Bayesian learning for this task. Apply naive Bayesian learning algorithm to approximate P (Reads = true) , P (X | Reads = true) , and P (X | Reads = false) for X ∈ {Celebrity, Sports, P olitics, T echnology} with Laplace smoothing. Then, given a test example (f alse, true, f alse, true), use the resulting naive Bayesian classifier to estimate the probability that the person would read this article. (6 marks) COMPSCI 367 Artificial Intelligence

Assignment 2

Page 3 of 6

Answer. We apply Laplace smoothing in this question. It would also be ok if Laplace smoothing is not applied. Given the test example (f alse, true, f alse, true), we compute the probability that the person would read this article: P (Reads = true|Celebrity = false ∧ Sports = true ∧ P olitics = f alse ∧ T echnology = true) =P (Celebrity = false|Reads = true) × P (Sports = true|Reads = true) × P (P olitics = f alse|Reads = true)× P (T echnology = true|Reads = true) × P (Reads = true) = 3 63 7 = 5120 × 12 × 38 × 38 × 15 . 8 4. Iris, a flowering plant, can be classified into different species by the size of their sepals and petals. Some classified examples are given as follow: Example e1 e2 e3 e4 e5 e6 e7 e8

Sepal.Length 5.1 4.9 4.7 4.6 6.1 6.3 6.1 6.4

Sepal.Width 3.5 3.0 3.2 3.1 2.8 2.5 2.8 2.9

Petal.Length 1.4 1.4 1.3 1.5 4.0 4.9 4.7 4.3

Petal.Width 0.2 0.2 0.2 0.2 1.3 1.5 1.2 1.3

Species Setosa Setosa Setosa Setosa Versicolor Versicolor Versicolor Versicolor

Show the detailed steps of the perceptron learning algorithm starting with the weight vector (w0 , w1 , w2 , w3 , w4 ) = (1, 0, 0, 0, 0) until it has trained all input once. You should apply the examples in the given order. For each step, write down the applied example, the classification result and the update of the weight vector. (6 marks) Answer. The following table illustrates the running of the algorithm for one round. e1 e2 e3 e4 e5 e6 e7 e8

current w ~ example ~x (1, 0, 0, 0, 0) (1, 5.1, 3.5, 1.4, 0.2) (0, −5.1, −3.5, −1.4, −0.2) (1, 4.9, 3.0, 1.4, 0.2) (0, −5.1, −3.5, −1.4, −0.2) (1, 4.7, 3.2, 1.3, 0.2) (0, −5.1, −3.5, −1.4, −0.2) (1, 4.6, 3.0, 1.5, 0.2) (0, −5.1, −3.5, −1.4, −0.2) (1, 6.1, 2.8, 4.0, 1.3) (1, 1, −0.7, 2.6, 1.1) (1, 6.3, 2.5, 4.9, 1.5) (1, 1, −0.7, 2.6, 1.1) (1, 6.4, 2.9, 4.3, 1.3) (1, 1, −0.7, 2.6, 1.1) (1, 6.4, 2.9, 4.3, 1.3)

sum class 1 0 -37.49 0 -37.03 0 -36.45 0 -46.77 1 19.94 1 17.98 1 17.98 1

p.class 1 0 0 0 0 1 1 1

update −(1, 5.1, 3.5, 1.4, 0.2) No update No update No update +(1, 6.1, 2.8, 4.0, 1.3) No update No update No update

5. Download the data set data1.arff and run both naive Bayesian learning and linear classification using Weka on this data set (with class as the only target feature). You can tweak the parameters to try to obtain a better accuracy. Evaluate the performance of both algorithms. • Describe the steps you make in your experiments. Describe the result obtained from your experiments. What is the output of each algorithm? • Which algorithm gives significantly better result? • Explain the possible reasons behind the performance of each algorithm. (5 marks)

COMPSCI 367 Artificial Intelligence

Assignment 2

Page 4 of 6

Answer. Experimental steps: Choose data1.arff in Explorer and classify the data using NaiveBayes for naive bayesian classification and SGD for linear classification. You need to describe the settings used for evaluating the classification results, e.g., k-fold cross validation, or defining a split percentage. It is not good to use only “use training set” as the test options. The results should be presented in details including the confusion matrix and errors. Linear classification should work significantly better than Bayesian classification. This is because the positive and negative instances are clearly separated by a downward straight line in the plane, as seen in the plot below:

Also observe that the input features x and y are not independent given the output class. For example, given that the sample is positive (class = +), the probability that x is a high value (x > 1.21) given that y is a high value (y > 0.75) is different from the probability that x > 1.21, i.e., P (x > 1.21|y > 0.75 ∧ class = +) 6= P (x > 1.21|class = +). Thus naive Bayesian classifier would not result in a good performance on this data set. 6. Download the data set data2.arff and run kNN algorithm on this data set (with class as the only target feature). Run the algorithm with different values of k and observe the results. Explain the possible reasons behind the performance of the algorithm. (4 marks) Answer. The samples are visualised below. It is easy to see that for any sample, the closest instance (by Euclidean distance) has a different class. Thus, for any instance x, the 1NN algorithm will result in misclassifying all instances (note that this is unless we choose “use training set” option, which in fact is meaningless here). For any sample x, the k closest instances will have a majority class different from x’s class, when k is 3, 5,9. Thus the corresponding kNN algorithm would also lead to incorrectly classified instances at 100% or close to 100%.

COMPSCI 367 Artificial Intelligence

Assignment 2

Page 5 of 6

Marking Schedule 1. Question 1. 1 Mark for naming propositions. 2 Marks for correct answer for (a) 3 Marks for correct answer for (b) Note that they may apply different working and get the same correct result. 2. Question 2. 2 Marks for each of (a),(b) and (c). For each question, 1 mark for working, 1 mark for correct result. 3. Question 3 (a). There are 7 marks. They do not need to show all working details. E.g. it is enough if they just show the working for finding the root, and the use English to describe the rest. As long as the working demonstrates that they understand the process, they should be given the marks. Here they do not need to use Laplace smoothing, but they can use it as well. This means that there is no one unique answer for this question. 2 Marks for applying the correct (conditional) entropy formula 2 Marks for applying the correct formula for calculating information gain 1 Mark for finding the correct feature to split on 2 Marks for drawing the decision tree. (if tree is partial, give 1 mark) 4. Question 3(b) 3 Marks for identifying the correct values of the conditional probabilities. 3 Marks for applying the naive bayesian classification formula to get a correct answer. They do not need to use Laplace smoothing. 5. Question 4 1 mark each for the correct result after processing each sample. 6. Question 5 1 mark for describing experiment steps. 1 mark for describing results 1 mark for comparing the two algorithms 1 mark for explaining why linear classification should work on this data set. 1 mark for explaining why naive bayesian classifier would not work well. 7. Question 6. 2 marks for describing the experiment settings and results. Note that for NN, it is not good to use ”Use training set” test option as this would obviously result in 100% accurate result as the test samples are drawn from the training set in this case. 2 marks for explaining why NN and kNN does not work well on the data set.

COMPSCI 367 Artificial Intelligence

Assignment 2

Page 6 of 6...