Lab Report 11 PDF

Title	Lab Report 11
Author	Bailey Burke
Course	Little Bits To Big Ideas
Institution	University of Illinois at Urbana-Champaign
Pages	7
File Size	445.1 KB
File Type	PDF
Total Downloads	57
Total Views	123

Preview

CLICK TO PREVIEW PDF

Summary

Lab report #11
INFO 102
Professor Ryan Cunningham...

Description

Lab Report #11

1. Procedure For this lab, I began is Spyder. First, I opened the wine.txt file, as part of the lab’s machine learning methods would be based around this data. The data by itself was a bit difficult to make sense of, though the lab directions stated it consisted of 178 different types of wine, each with 13 identifying features. The first program I ran was tree.py. This program used the decision tree machine learning method to try to correctly predict which cultivar each wine in the data set belongs to. The results from this program were an accuracy metric, time taken, and a text file that I converted through the given link into the resulting decision tree. Next, I ran the program neighbor.py. This program used a K-nearest neighbors’ algorithm to also try to correctly predict which cultivar each wine in the data set belongs to. The results from this program were an accuracy score and time taken. Similarly, I ran the svm.py program and the neural.py program. Each program had the same goal as tree.py and neighbor.py, except the first used the support-vector machine learning method, and the second used the neural networks method. Finally, I opened the mystery.py file. This file contained code allowing the user to run different classifiers to try to reach the highest possible predictive accuracy rating for the mystery data set, mystery.txt. I ran this program with the given classifier, and also tried the code for the K-nearest neighbors, SVM, and decision tree approaches. After that, I tried changing the parameter within the function, as I was able to find different parameters through Google searches of the parameters. After several different tries, I had concluded the lab.

2. Results My first results came from the tree.py program. I ran the program two times, and found similar accuracy and time metrics. Each was about an accuracy rating of .90, and a time taken of about .3 seconds. Additionally, my results yielded a text file, which could be converted through the lab’s given website into a decision tree. This tree shows the machine learning method, and how true and false criteria are used to work down the tree into increasingly specific areas based on the 13 given features in the wines.txt file.

My results from the neighbor.py, svm.py, svm.py, and neural.py programs are all similar in that they provide the accuracy rating in correctly predicting which cultivar each wine belonged to, and the time taken to do so. I ran each 2 or 3 times to make sure the results were relatively consistent. Finally, my results from the mystery.py program show that most accurate classifier was .73 using the neural classifier with the parameter “tanh, which uses the hyperbolic tan function. The first mystery.py results show the results of each basic classifier used in the other portions of the lab, grouped in order of neural, neighbor, svm, and finally decision tree. Some were run multiple times to check for consistency. The second mystery.py results shows the results from the “tanh parameter. All lab results are below:

Tree.py results:

Dtree.txt from running tree.py:

Tree generated from dtree.txt:

Neighbor.py results:

SVM.py results:

Neural.py results:

Mystery.py results, using neural, neighbors, svm, decision tree classifiers, respectively

Most accurate mystery.py classifier and parameter MLPClassifieraciaionanh

3. Discussion

1. What feature was most important in the decision tree classifier? What leads you to conclude that this is the most important feature? The feature that was most important in the decision tree classifier was flavonoids. I concluded this because the flavonoid feature was used to create the first two children from the entire data set. If it was true that flavonoid value for a wine was less than or equal to 1.575, then the child was down and left. If it was false, the other child was down and right. It is most important because it split an initial sample of 178 wines into two smaller groups of 62 and 116 wines. From there, other features were used to further classify the wines.

2. The decision tree classifier can split most of the first cultivar from the rest of the wines by making a series of decisions. What are those decisions? The decision tree classifier split most of the first cultivar from the rest of the wines by making a series of true/false decisions. These decisions are false that flavonoid value is less than or equal to 1.575, false that proline value is less than or equal to 724.5, and false that intensity value is less than or equal to 3.46. In the case of splitting the first cultivar, each decision was false.

3. The decision tree classifier can identify most of the third cultivar. What series of decisions would lead to a classification as the third cultivar? The decisions the decision tree classifier uses to identify most of the third cultivar are true that flavonoid value is less than or equal to 1.575, false that intensity is less than or equal to 3.825, and false that ash is less than or equal to 2.06.

4. Which of the machine learning algorithms performs the best on the wine data set? Which performs the worst? Which takes the longest time to train? Which takes the least?

The machine learning algorithm that performs best on the wine data set is the neural network algorithm. The method that takes the longest time to train uses the k-nearest neighbors algorithm, and the method that takes the least time to train uses the decision tree algorithm.

5. What classifiers and parameters did you try on the mystery dataset? I tried the parameters of “identity, “logistic, “tanh, and “relu, from the neural classifier on the mystery dataset. I also tried the “uniform parameter with the K-nearest neighbors classifier, and the “gini parameter with the decision tree classifer.

6. What was your best performance on the mystery dataset? What algorithm and parameters did you use to get this result? My best performance on the mystery dataset was an accuracy rating of .73, though it did take 83 seconds. This was yielded by the neural network algorithm with the “tanh parameter and is included in my results.

7. Can you guess what the mystery task is? My guess is that the mystery data set is coordinates, and the mystery task is to organize the coordinates geographically. Maybe there are different geographic groups assigned numeric values.

4. Conclusions The first thing I learned in this lab was how a decision tree algorithm works as a machine learning method. By analyzing the generated decision tree from running the tree.py file, I was able to understand how each child was created. I cross-references the tree with the data in wines.txt, and since wines.txt lists which cultivar each wine belongs to, I was able to follow along the tree and understand how it organized the data. This was also interesting because in lecture Ryan explained that the machine learning algorithm likely won’t have 100% accuracy, and sure enough, by following the decision tree, I could see the tree couldn’t perfectly split the wines into their 3 cultivars. Next, I learned about the relationships between the 4 machine learning methods used in lab. While I understood the basics of each in lecture, there was no way to really understand how they compare to each other. While I’m sure its dependent on the data set, this lab at least showed me that using different machine learning algorithms can have varying results as far as correctly predicting data goes. In this lab, each algorithm had similar accuracy ratings, though they varied more greatly in time to train. Finally, this lab taught me the difficulty of creating a highly accurate machine learning algorithm. I learned this through the mystery data set, I tried 10 different combinations of classifiers and parameters on the mystery data set, and the highest accuracy rating I was able to yield was only .73. This is not impressive, and likely wouldn’t be acceptable if the algorithm needed to be used in real life. As Ryan

explained, machine learning can be risky because it is extremely difficult to create a perfect algorithm, and my trials with the mystery data set proved this to me....