ML2020 04 - Appunti di lezione 4 PDF

Title	ML2020 04 - Appunti di lezione 4
Author	Michele Varacchi
Course	Machine Learning For Pattern Recognition
Institution	Università degli Studi di Parma
Pages	44
File Size	2.5 MB
File Type	PDF
Total Downloads	107
Total Views	133

Preview

CLICK TO PREVIEW PDF

Summary

Machine Learning Applications...

Description

Introduction to Weka

Data Mining with Weka  What is Weka? – A bird found only in New Zealand?

 Data mining workbench Waikato Environment for Knowledge Analysis

Machine learning algorithms for data mining tasks (everincreasing numbers …) • 100+ algorithms for classification • 75 for data preprocessing • 25 to assist with feature selection • 20 for clustering, finding association rules, etc

Textbook This textbook discusses data mining, and Weka, in depth:

Data Mining: Practical machine learning tools and techniques, by Ian H. Witten, Eibe Frank and Mark A. Hall. Morgan Kaufmann, 2011

The publisher has made available parts relevant to this course in ebook format.

Weka



WEKA lets one choose among a huge number of modules which implement the main steps necessary to develop a classification/data mining application (filters, classifiers, clustering algorithms, graphical analysis of results)



Modules can be accessed: o using a command-line interface o through one of the GUIs provided by Weka o Within one’s own code as functions of a JAVA package



Modules are managed using the Package Manager: new modules are being regularly added and can be installed from it. A (beta) Python interface can also be uploaded

Online courses on Weka 

Basic: “Data mining with WEKA” ◦ 5 lessons of 6 modules each (7-10 minutes + exercises): about 15 hours of total engagement including time for assignments. ◦ https://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/ ◦ https://www.youtube.com/watch?v=LcHw2ph6bss&list=PLm4W7_iX_v 4NqPUjceOGd-OKNVO4c_cPD



Intermediate: “More data mining with WEKA” ◦ http://www.cs.waikato.ac.nz/ml/weka/mooc/moredataminingwithweka/ ◦ https://www.youtube.com/watch?v=iqQn6YfyGs0&list=PLm4W7_iX_v4 OMSgc8xowC2h70s-unJKCp



Advanced: “Advanced data mining with WEKA” ◦ https://weka.waikato.ac.nz/ml/weka/mooc/advanceddataminingwithweka ◦ https://www.youtube.com/watch?v=Lhw_XcGCTFg

Suggested readings/online material 

First two units of the online course (6+6 modules) on Weka: they include a revision of some concepts we introduced in the previous lessons.



Online in our course site: ◦ Slides associated to chapter 5 of the book “Data mining” by Witten, Frank, and Hall (CH5_TrainTest.pdf)  Very synthetic, but covering most of the problems related with model quality evaluation (more than we actually do in the course)

◦ “Machine Learning with Weka” (slides) by E. Frank (weka.pdf)  Short crash course on using Weka interfaces

Classification  Training and testing

Training data

ML algorithm

Classifier

Test data

Results

Weka 

(http://www.cs.waikato.ac.nz/ml/weka/ )

The phases in which a classifier’s design can be divided are reflected in WEKA’s Explorer structure: o Data pre-processing (filtering) and representation o Supervised Learning / Classification, or o Unsupervised Learning / Clustering, or

o Rule-based classification o Visualization

Dataset Attributes Instances

Outlook

Temp

Humidity Windy

Play

1

Sunny

Hot

High

False

No

2

Sunny

Hot

High

True

No

3

Overcast

Hot

High

False

Yes

4

Rainy

Mild

High

False

Yes

5

Rainy

Cool

Normal

False

Yes

6

Rainy

Cool

Normal

True

No

7

Overcast

Cool

Normal

True

Yes

8

Sunny

Mild

High

False

No

9

Sunny

Cool

Normal

False

Yes

10

Rainy

Mild

Normal

False

Yes

11

Sunny

Mild

Normal

True

Yes

12

Overcast

Mild

High

True

Yes

13

Overcast

Hot

Normal

False

Yes

14

Rainy

Mild

High

True

No

Data files in Weka Data files in Weka (arff files) have a simple format @relation weather.symbolic @attribute outlook {sunny, overcast, rainy} @attribute temperature {hot, mild, cool} @attribute humidity {high, normal} @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,hot,high,FALSE,no sunny,hot,high,TRUE,no overcast,hot,high,FALSE,yes rainy,mild,high,FALSE,yes rainy,cool,normal,FALSE,yes rainy,cool,normal,TRUE,no overcast,cool,normal,TRUE,yes sunny,mild,high,FALSE,no sunny,cool,normal,FALSE,yes rainy,mild,normal,FALSE,yes sunny,mild,normal,TRUE,yes overcast,mild,high,TRUE,yes overcast,hot,normal,FALSE,yes rainy,mild,high,TRUE,no

Classification  The Goal attributes

Outlook

instances

Temp

Humidity

Windy

Play

1

Sunny

Hot

High

False

No

2

Sunny

Hot

High

True

No

3

Overcast

gh

False

Yes

gh

False

Yes

Normal

False

Yes

Normal

True

No

Normal

True

Yes

e

No

Normal

False

Yes

Normal

False

Yes

High

11

12 13 14

Sunny

Mild

Normal

True

Yes

Overcast

Mild

High

True

Yes

Overcast

Hot

Normal

False

Yes

Rainy

Mild

High

True

No

Weka 

WEKA includes three GUIs: ◦ Explorer  Interactive pre-processing, data classification and algorithm configuration

◦ Experimenter  Development of applications as cascaded modules and environment for organizing multiple executions of algorithms (on different data sets or using different random seeds) with final generation of a summary of the results.

◦ Knowledge Flow  GUI for ‘visual’ building of an application

as well as a Command-Line Interface from which it is possible to run the different modules as independent applications.

Weka control window

Performance comparisons

Graphical interface Command‐line interface

Visualize your data Weka’s boundary visualizer

iris.2D.arff 2.5

petalwidth

0.1

1

6.9

petallength

Exploring the Explorer

Exploring the Explorer open file weather.nominal.arff

Exploring the Explorer

attribute values attributes

Attribute Selection: Information Gain

Decision Trees: J48 (ID3) in Weka

More quality criteria

Quality criteria for n-ary classifiers Given two confusion matrices for a 3-class problem: actual predictor (left) vs. random predictor (right) Random Predictor P(a|x) = P(a) P(b|x) = P(b) P(c|x) = P(c) Number of successes: sum of entries in diagonal (D) Kappa statistic:

Dobserved − Drandom Dperfect − Drandom

measures the relative improvement over a random predictor.

Quality criteria for n-ary classifiers Given two confusion matrices for a 3-class problem: actual predictor (left) vs. random predictor (right) Random Predictor P(a|x) = P(a) P(b|x) = P(b) P(c|x) = P(c) Kappa statistic:

So

Dobserved − Drandom Dperfect − Drandom K...