BI-lect 9 - Appunti di lezione 9 PDF

Title	BI-lect 9 - Appunti di lezione 9
Author	Rameez Riaz
Course	Business intelligence e data mining
Institution	Politecnico di Milano
Pages	21
File Size	1.6 MB
File Type	PDF
Total Downloads	42
Total Views	196

Preview

CLICK TO PREVIEW PDF

Summary

Carlo Vercellis
Old "Business Intelligence" Course Slides...

Description

door

machinelearning bigdataanalytics

Regression Carlo Vercellis [email protected] Politecnico di Milano School of Management - Campus Bovisa Sud via Lambruschini 4b 20156 Milano

[email protected] - www.door.polimi.it - via Lambruschini 4b, 20156 Milano

Regression models

dataset

2

contains

observations and

attributes

independent attributes (explanatory, predictors) and 1 dependent attribute (target, response) observations are points in a target attribute is denoted as is the ,

matrix of data, and

dimensional space, the

is the target vector

are random variables,

Business Intelligence © Carlo Vercellis

1

Regression models

3

spurious correlation the hypothesis space should be simple linear

quadratic

exponential

Business Intelligence © Carlo Vercellis

Accuracy vs. generalization

4

Model Past data New data

Business Intelligence © Carlo Vercellis

2

Simple linear regression

5

deterministic model

probabilistic model

Business Intelligence © Carlo Vercellis

Simple linear regression

6

Business Intelligence © Carlo Vercellis

3

Simple linear regression

7

residuals

least squares regression: minimize the sum of squared residuals

Business Intelligence © Carlo Vercellis

Simple linear regression

8

find the minimum

normal equation (linear system depending from the coefficients)

Business Intelligence © Carlo Vercellis

4

Simple linear regression

9

Business Intelligence © Carlo Vercellis

Simple linear regression

10

prediction

alternatively imposing a cross at the origin

Business Intelligence © Carlo Vercellis

5

Multiple linear regression

11

probabilistic model

extend matrix X by a vector with all components = 1

Business Intelligence © Carlo Vercellis

Multiple linear regression

12

sum of squared residuals

null partial derivatives

normal equation

minimum point

Business Intelligence © Carlo Vercellis

6

Multiple linear regression

13

values predicted by model

hat matrix

residuals

Business Intelligence © Carlo Vercellis

Assumptions on the residuals

14

random variable  should follow a normal distribution of mean 0 and constant variance

residuals  i e  k should be independent

estimate of 

if standard deviation  is constant the model shows omoscedasticity, otherwise eteroscedasticity

Business Intelligence © Carlo Vercellis

7

Heteroscedasticity

15

Business Intelligence © Carlo Vercellis

Heteroscedasticity

16

Business Intelligence © Carlo Vercellis

8

Heteroscedasticity

17

Business Intelligence © Carlo Vercellis

Categorical variables

18

variable Xj assumes values in the set

H-1 dummy variables

to represent the month associated to an observation

Business Intelligence © Carlo Vercellis

9

Ridge regression

19

the estimation of matrix may be critical (insufficient number of observations, multi-collinearity): ill-posed problem

limit the width of the hypothesis space F (regularization theory)

Business Intelligence © Carlo Vercellis

Generalized linear models

20

Functions gh represent any set of bases, such as polynomials, kernels and other groups of nonlinear functions

Coefficients wh and b can be determined through the minimization of the sum of squared errors. Function SSE in this formulation is more complex than for linear regression, solution of the minimization problem more difficult

Business Intelligence © Carlo Vercellis

10

Normality and independence of the residuals

21

goodness-of-fit test (chi-squared, Kolmogorov-Smirnov)

graphical analysis scatterplot residuals vs. fitted scatterplot standardized residuals vs. fitted qq-plot of the residuals Cook distance identifies anomalies and large residuals for values greater than 1

Business Intelligence © Carlo Vercellis

Scatterplot residuals vs. fitted

22

Residuals vs Fitted

4

4

0 -2

Residuals

2

5

-4

11

5

10

15

20

25

30

Fitted values

Business Intelligence © Carlo Vercellis

11

Scatterplot standardized residuals vs. fitted

23

Scale-Location plot 4

0.4

0.6

0.8

1.0

11

0.0

0.2

Standardized residuals

1.2

5

5

10

15

20

25

30

Fitted values

Business Intelligence © Carlo Vercellis

qq-plot of the residuals

24

2

Normal Q-Q plot

4

1 0 -1

Standardized residuals

5

11

-1

0

1

Theoretical Quantiles

Business Intelligence © Carlo Vercellis

12

Cook distance

25

0.4

Cook's distance plo t

5

0.2

1

0.0

0.1

Cook's distance

0.3

11

2

4

6

8

10

12

14

Obs. number

Business Intelligence © Carlo Vercellis

Significance of the coefficients

26

ˆ represent an estimate of w regression coefficients w

covariance matrix of the estimator

confidence intervals

for simple regression

Business Intelligence © Carlo Vercellis

13

Significance of the coefficients

27

hypothesis test

mold – politecnico milano - © all rights reserved Business Intelligence di © Carlo Vercellis

Analysis of variance

28

df: n degrees of freedom sum of sq:

df: m-n-1 degrees of freedom

df: m-1 degrees of freedom sum of sq:

Business Intelligence © Carlo Vercellis

14

Analysis of variance

29

the aim of regression models is to explain through predictive variables most part of variance inherent in dependent variable, leaving aside pure random fluctuation – residuals

If this goal is achieved, one expects sample variance of residuals significantly smaller than sample variance of response variable

if the residuals have normal distribution, the following ratio follows an F distribution with n e m-n-1 degrees of freedom

Business Intelligence © Carlo Vercellis

Determination coefficient

30

determination coefficient

in the example

adjusted coefficient

in the example

Business Intelligence © Carlo Vercellis

15

Determination and linear correlation coefficient

31

Business Intelligence © Carlo Vercellis

Linear correlation coefficient

32

if r > 0, then X and Y are concordant. if r < 0, then X and Y are discordant. finally if r is close to 0, there is non linear relationship between X e Y.

Business Intelligence © Carlo Vercellis

16

Linear correlation coefficient

33

Business Intelligence © Carlo Vercellis

Multi-collinearity of the independent variables

34

variance inflation factor: values greater than 5 point to the existence of multi-collinearity

Business Intelligence © Carlo Vercellis

17

Confidence and prediction limits

35

prediction associated to a new observation

its variance is

for simple regression the confidence limit for E[Y] is

whereas the prediction interval for Y is

Business Intelligence © Carlo Vercellis

Confidence and prediction limits

36

Business Intelligence © Carlo Vercellis

18

Example: mtcars

37

Business Intelligence © Carlo Vercellis

Example: mtcars

38

6

8

50

250

2

4

0.0 0.6

3.0

4.5 25

4

8

10

mpg

100 400

4

6

cyl

50 250

disp

3.0 4.5

hp

4

drat

22

2

wt

0.8

16

qsec

0.8

0.0

vs

4.5

0.0

am

carb 10

25

100

400

3.0

4.5

16

22

0.0 0.6

1

4

1 4 7

3.0

gear

7

Business Intelligence © Carlo Vercellis

19

Example: mtcars

39

Business Intelligence © Carlo Vercellis

Example: mtcars

40

Business Intelligence © Carlo Vercellis

20

Example: mtcars

41

Business Intelligence © Carlo Vercellis

Example: mtcars

42

Business Intelligence © Carlo Vercellis

21...