Title | BI-lect 9 - Appunti di lezione 9 |
---|---|
Author | Rameez Riaz |
Course | Business intelligence e data mining |
Institution | Politecnico di Milano |
Pages | 21 |
File Size | 1.6 MB |
File Type | |
Total Downloads | 42 |
Total Views | 196 |
Carlo Vercellis
Old "Business Intelligence" Course Slides...
door
machinelearning bigdataanalytics
Regression Carlo Vercellis [email protected] Politecnico di Milano School of Management - Campus Bovisa Sud via Lambruschini 4b 20156 Milano
[email protected] - www.door.polimi.it - via Lambruschini 4b, 20156 Milano
Regression models
dataset
2
contains
observations and
attributes
independent attributes (explanatory, predictors) and 1 dependent attribute (target, response) observations are points in a target attribute is denoted as is the ,
matrix of data, and
dimensional space, the
is the target vector
are random variables,
Business Intelligence © Carlo Vercellis
1
Regression models
3
spurious correlation the hypothesis space should be simple linear
quadratic
exponential
Business Intelligence © Carlo Vercellis
Accuracy vs. generalization
4
Model Past data New data
Business Intelligence © Carlo Vercellis
2
Simple linear regression
5
deterministic model
probabilistic model
Business Intelligence © Carlo Vercellis
Simple linear regression
6
Business Intelligence © Carlo Vercellis
3
Simple linear regression
7
residuals
least squares regression: minimize the sum of squared residuals
Business Intelligence © Carlo Vercellis
Simple linear regression
8
find the minimum
normal equation (linear system depending from the coefficients)
Business Intelligence © Carlo Vercellis
4
Simple linear regression
9
Business Intelligence © Carlo Vercellis
Simple linear regression
10
prediction
alternatively imposing a cross at the origin
Business Intelligence © Carlo Vercellis
5
Multiple linear regression
11
probabilistic model
extend matrix X by a vector with all components = 1
Business Intelligence © Carlo Vercellis
Multiple linear regression
12
sum of squared residuals
null partial derivatives
normal equation
minimum point
Business Intelligence © Carlo Vercellis
6
Multiple linear regression
13
values predicted by model
hat matrix
residuals
Business Intelligence © Carlo Vercellis
Assumptions on the residuals
14
random variable should follow a normal distribution of mean 0 and constant variance
residuals i e k should be independent
estimate of
if standard deviation is constant the model shows omoscedasticity, otherwise eteroscedasticity
Business Intelligence © Carlo Vercellis
7
Heteroscedasticity
15
Business Intelligence © Carlo Vercellis
Heteroscedasticity
16
Business Intelligence © Carlo Vercellis
8
Heteroscedasticity
17
Business Intelligence © Carlo Vercellis
Categorical variables
18
variable Xj assumes values in the set
H-1 dummy variables
to represent the month associated to an observation
Business Intelligence © Carlo Vercellis
9
Ridge regression
19
the estimation of matrix may be critical (insufficient number of observations, multi-collinearity): ill-posed problem
limit the width of the hypothesis space F (regularization theory)
Business Intelligence © Carlo Vercellis
Generalized linear models
20
Functions gh represent any set of bases, such as polynomials, kernels and other groups of nonlinear functions
Coefficients wh and b can be determined through the minimization of the sum of squared errors. Function SSE in this formulation is more complex than for linear regression, solution of the minimization problem more difficult
Business Intelligence © Carlo Vercellis
10
Normality and independence of the residuals
21
goodness-of-fit test (chi-squared, Kolmogorov-Smirnov)
graphical analysis scatterplot residuals vs. fitted scatterplot standardized residuals vs. fitted qq-plot of the residuals Cook distance identifies anomalies and large residuals for values greater than 1
Business Intelligence © Carlo Vercellis
Scatterplot residuals vs. fitted
22
Residuals vs Fitted
4
4
0 -2
Residuals
2
5
-4
11
5
10
15
20
25
30
Fitted values
Business Intelligence © Carlo Vercellis
11
Scatterplot standardized residuals vs. fitted
23
Scale-Location plot 4
0.4
0.6
0.8
1.0
11
0.0
0.2
Standardized residuals
1.2
5
5
10
15
20
25
30
Fitted values
Business Intelligence © Carlo Vercellis
qq-plot of the residuals
24
2
Normal Q-Q plot
4
1 0 -1
Standardized residuals
5
11
-1
0
1
Theoretical Quantiles
Business Intelligence © Carlo Vercellis
12
Cook distance
25
0.4
Cook's distance plo t
5
0.2
1
0.0
0.1
Cook's distance
0.3
11
2
4
6
8
10
12
14
Obs. number
Business Intelligence © Carlo Vercellis
Significance of the coefficients
26
ˆ represent an estimate of w regression coefficients w
covariance matrix of the estimator
confidence intervals
for simple regression
Business Intelligence © Carlo Vercellis
13
Significance of the coefficients
27
hypothesis test
mold – politecnico milano - © all rights reserved Business Intelligence di © Carlo Vercellis
Analysis of variance
28
df: n degrees of freedom sum of sq:
df: m-n-1 degrees of freedom
df: m-1 degrees of freedom sum of sq:
Business Intelligence © Carlo Vercellis
14
Analysis of variance
29
the aim of regression models is to explain through predictive variables most part of variance inherent in dependent variable, leaving aside pure random fluctuation – residuals
If this goal is achieved, one expects sample variance of residuals significantly smaller than sample variance of response variable
if the residuals have normal distribution, the following ratio follows an F distribution with n e m-n-1 degrees of freedom
Business Intelligence © Carlo Vercellis
Determination coefficient
30
determination coefficient
in the example
adjusted coefficient
in the example
Business Intelligence © Carlo Vercellis
15
Determination and linear correlation coefficient
31
Business Intelligence © Carlo Vercellis
Linear correlation coefficient
32
if r > 0, then X and Y are concordant. if r < 0, then X and Y are discordant. finally if r is close to 0, there is non linear relationship between X e Y.
Business Intelligence © Carlo Vercellis
16
Linear correlation coefficient
33
Business Intelligence © Carlo Vercellis
Multi-collinearity of the independent variables
34
variance inflation factor: values greater than 5 point to the existence of multi-collinearity
Business Intelligence © Carlo Vercellis
17
Confidence and prediction limits
35
prediction associated to a new observation
its variance is
for simple regression the confidence limit for E[Y] is
whereas the prediction interval for Y is
Business Intelligence © Carlo Vercellis
Confidence and prediction limits
36
Business Intelligence © Carlo Vercellis
18
Example: mtcars
37
Business Intelligence © Carlo Vercellis
Example: mtcars
38
6
8
50
250
2
4
0.0 0.6
3.0
4.5 25
4
8
10
mpg
100 400
4
6
cyl
50 250
disp
3.0 4.5
hp
4
drat
22
2
wt
0.8
16
qsec
0.8
0.0
vs
4.5
0.0
am
carb 10
25
100
400
3.0
4.5
16
22
0.0 0.6
1
4
1 4 7
3.0
gear
7
Business Intelligence © Carlo Vercellis
19
Example: mtcars
39
Business Intelligence © Carlo Vercellis
Example: mtcars
40
Business Intelligence © Carlo Vercellis
20
Example: mtcars
41
Business Intelligence © Carlo Vercellis
Example: mtcars
42
Business Intelligence © Carlo Vercellis
21...