Title | ML2020 04 - Appunti di lezione 4 |
---|---|
Author | Michele Varacchi |
Course | Machine Learning For Pattern Recognition |
Institution | Università degli Studi di Parma |
Pages | 44 |
File Size | 2.5 MB |
File Type | |
Total Downloads | 107 |
Total Views | 133 |
Machine Learning Applications...
Introduction to Weka
Data Mining with Weka What is Weka? – A bird found only in New Zealand?
Data mining workbench Waikato Environment for Knowledge Analysis
Machine learning algorithms for data mining tasks (everincreasing numbers …) • 100+ algorithms for classification • 75 for data preprocessing • 25 to assist with feature selection • 20 for clustering, finding association rules, etc
Textbook This textbook discusses data mining, and Weka, in depth:
Data Mining: Practical machine learning tools and techniques, by Ian H. Witten, Eibe Frank and Mark A. Hall. Morgan Kaufmann, 2011
The publisher has made available parts relevant to this course in ebook format.
Weka
WEKA lets one choose among a huge number of modules which implement the main steps necessary to develop a classification/data mining application (filters, classifiers, clustering algorithms, graphical analysis of results)
Modules can be accessed: o using a command-line interface o through one of the GUIs provided by Weka o Within one’s own code as functions of a JAVA package
Modules are managed using the Package Manager: new modules are being regularly added and can be installed from it. A (beta) Python interface can also be uploaded
Online courses on Weka
Basic: “Data mining with WEKA” ◦ 5 lessons of 6 modules each (7-10 minutes + exercises): about 15 hours of total engagement including time for assignments. ◦ https://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/ ◦ https://www.youtube.com/watch?v=LcHw2ph6bss&list=PLm4W7_iX_v 4NqPUjceOGd-OKNVO4c_cPD
Intermediate: “More data mining with WEKA” ◦ http://www.cs.waikato.ac.nz/ml/weka/mooc/moredataminingwithweka/ ◦ https://www.youtube.com/watch?v=iqQn6YfyGs0&list=PLm4W7_iX_v4 OMSgc8xowC2h70s-unJKCp
Advanced: “Advanced data mining with WEKA” ◦ https://weka.waikato.ac.nz/ml/weka/mooc/advanceddataminingwithweka ◦ https://www.youtube.com/watch?v=Lhw_XcGCTFg
Suggested readings/online material
First two units of the online course (6+6 modules) on Weka: they include a revision of some concepts we introduced in the previous lessons.
Online in our course site: ◦ Slides associated to chapter 5 of the book “Data mining” by Witten, Frank, and Hall (CH5_TrainTest.pdf) Very synthetic, but covering most of the problems related with model quality evaluation (more than we actually do in the course)
◦ “Machine Learning with Weka” (slides) by E. Frank (weka.pdf) Short crash course on using Weka interfaces
Classification Training and testing
Training data
ML algorithm
Classifier
Test data
Results
Weka
(http://www.cs.waikato.ac.nz/ml/weka/ )
The phases in which a classifier’s design can be divided are reflected in WEKA’s Explorer structure: o Data pre-processing (filtering) and representation o Supervised Learning / Classification, or o Unsupervised Learning / Clustering, or
o Rule-based classification o Visualization
Dataset Attributes Instances
Outlook
Temp
Humidity Windy
Play
1
Sunny
Hot
High
False
No
2
Sunny
Hot
High
True
No
3
Overcast
Hot
High
False
Yes
4
Rainy
Mild
High
False
Yes
5
Rainy
Cool
Normal
False
Yes
6
Rainy
Cool
Normal
True
No
7
Overcast
Cool
Normal
True
Yes
8
Sunny
Mild
High
False
No
9
Sunny
Cool
Normal
False
Yes
10
Rainy
Mild
Normal
False
Yes
11
Sunny
Mild
Normal
True
Yes
12
Overcast
Mild
High
True
Yes
13
Overcast
Hot
Normal
False
Yes
14
Rainy
Mild
High
True
No
Data files in Weka Data files in Weka (arff files) have a simple format @relation weather.symbolic @attribute outlook {sunny, overcast, rainy} @attribute temperature {hot, mild, cool} @attribute humidity {high, normal} @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,hot,high,FALSE,no sunny,hot,high,TRUE,no overcast,hot,high,FALSE,yes rainy,mild,high,FALSE,yes rainy,cool,normal,FALSE,yes rainy,cool,normal,TRUE,no overcast,cool,normal,TRUE,yes sunny,mild,high,FALSE,no sunny,cool,normal,FALSE,yes rainy,mild,normal,FALSE,yes sunny,mild,normal,TRUE,yes overcast,mild,high,TRUE,yes overcast,hot,normal,FALSE,yes rainy,mild,high,TRUE,no
Classification The Goal attributes
Outlook
instances
Temp
Humidity
Windy
Play
1
Sunny
Hot
High
False
No
2
Sunny
Hot
High
True
No
3
Overcast
gh
False
Yes
gh
False
Yes
Normal
False
Yes
Normal
True
No
Normal
True
Yes
e
No
Normal
False
Yes
Normal
False
Yes
High
11
12 13 14
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
Weka
WEKA includes three GUIs: ◦ Explorer Interactive pre-processing, data classification and algorithm configuration
◦ Experimenter Development of applications as cascaded modules and environment for organizing multiple executions of algorithms (on different data sets or using different random seeds) with final generation of a summary of the results.
◦ Knowledge Flow GUI for ‘visual’ building of an application
as well as a Command-Line Interface from which it is possible to run the different modules as independent applications.
Weka control window
Performance comparisons
Graphical interface Command‐line interface
Visualize your data Weka’s boundary visualizer
iris.2D.arff 2.5
petalwidth
0.1
1
6.9
petallength
Exploring the Explorer
Exploring the Explorer open file weather.nominal.arff
Exploring the Explorer
attribute values attributes
Attribute Selection: Information Gain
Decision Trees: J48 (ID3) in Weka
More quality criteria
Quality criteria for n-ary classifiers Given two confusion matrices for a 3-class problem: actual predictor (left) vs. random predictor (right) Random Predictor P(a|x) = P(a) P(b|x) = P(b) P(c|x) = P(c) Number of successes: sum of entries in diagonal (D) Kappa statistic:
Dobserved − Drandom Dperfect − Drandom
measures the relative improvement over a random predictor.
Quality criteria for n-ary classifiers Given two confusion matrices for a 3-class problem: actual predictor (left) vs. random predictor (right) Random Predictor P(a|x) = P(a) P(b|x) = P(b) P(c|x) = P(c) Kappa statistic:
So
Dobserved − Drandom Dperfect − Drandom K...