3rd part Statistics - 30001 Bemacc class 12 PDF

Title 3rd part Statistics - 30001 Bemacc class 12
Author Francesca Ernani
Course Statistica / Statistics
Institution Università Commerciale Luigi Bocconi
Pages 18
File Size 1.9 MB
File Type PDF
Total Downloads 59
Total Views 120

Summary

nullNB: common threshold for the normal approximation is ​n=30​.Once that the sample is drawn and the characteristic is measured on each unit, the collection of values observed is called ​sample realization​ and denoted by (x​ 1 ​, ... , x​n​).Let X be the random variable describing the experiment c...


Description

#11 NB: common threshold for the normal approximation is n=30. Once that the sample is drawn and the characteristic is measured on each unit, the collection of values observed is called sample realization  and denoted by (x1, … , xn). Let X be the random variable describing the experiment consisting in drawing a unit from a population and measuring a characteristic on it. The population can be seen as the set of all possible values of X along with the probabilities of observing them. The population can be given a probabilistic definition as the probability distribution of X. The random sample is an i.i.d sample from the distribution of X. 3 inferential problems can be addressed: ● Point estimation problem  We look for a good guess (approximation) of an unknown parameter ● Interval estimation problem W  e look for an interval of values containing an unknown parameter with a predetermined level of confidence ● Hypothesis testing problem W  e need to determine if there is enough empirical evidence to reject a given statement (hypothesis) on a parameter Any inference is based on the information provided by the sample. The sample information is summarized by functions of the sample that are called sample statistics. Sample statistics as functions of random variables are random variables. In particular inferences on the population mean are based on two statistics, the sample mean and the sample variance. In an estimation problem, we want to approximate or estimate an unknown parameter. The sample statistic used for the estimation of a parameter is called parameter estimator, its observed value parameter estimate. The SAMPLE MEAN is said to be an u  nbiased estimator of the population mean. If we use the sample mean as an estimator of the unknown population mean, we might over-estimate or under-estimate it but the higher is the sample size the closer we get to the unknown parameter. The sample mean is a c  onsistent estimator of the population mean. The standard deviation of the sample mean σ/sqrt(n) is called standard error of the mean and can be seen as an overall measure of the error we make by estimating the unknown population mean using the sample mean. It can be shown that the mean or the expected value of the SAMPLE VARIANCE s2 is the population variance σ2 , so that the sample variance is an unbiased estimator of the population variance. Estimation of the standard error of the mean: the standard error of the mean, σ/sqrt(n), depends on the unknown population standard deviation σ. We can estimate the standard error by estimating σ through the sample standard deviation s. The standard error of the mean is estimated by s/sqrt(n). Any inference on the unknown population proportion is based on the SAMPLE PROPORTION denoted by P[hat]. It is the proportion or percentage of times the outcome of interest is observed in the sample. The sample proportion is the sample mean of a sample of variables (X1, … , Xn) i.i.d with Bernoulli distribution of parameter p.

#12 In addressing an estimation problem, we can’t stop at the point estimation of a parameter. We need to evaluate as well the accuracy of the estimation. We need to complete our analysis by providing a measure of the error we make in estimating the unknown parameter by means of a given sample statistic. For an unbiased estimator, the standard error measures the spread of the estimates around the parameter and therefore provides a first evaluation of the accuracy of the estimation procedure. The lower the standard error the higher the accuracy of the estimator. The standard error of the mean σ/ sqrt(n) is estimated by S/sqrt(n) where S is the sample standard deviation. The standard error of the mean provides a first way for assessing the accuracy of the sample mean as an estimator of the population mean. A confidence interval estimation of an unknown parameter provides an approximation of the parameter and, at the same time, a probabilistic assessment on the approximation error that we make. Let begin by considering the (theoretical) case of a sample of i.i.d random variables having normal  distribution with unknown mean µ and known variance σ2 . We know that the unknown population mean can be estimated by the sample mean X[bar]. We know as well that the sample mean in this case has a normal distribution with mean µ and known variance σ2/n. Standardization of the sample mean: the following random variable has a standard normal distribution:

The accuracy of the estimation through confidence intervals is measured by the length of the interval. The longer the interval the lower the accuracy of the estimation.

— Let now consider the more realistic case of a sample of i.i.d random variables having normal distribution with unknown mean µ and unknown variance σ2 . The following variable has a distribution called Student’s t distribution with (n-1) degrees of freedom:

Student’s t distribution is similar to a standard normal distribution. It has heavier tails compared to the standard normal. When the number of degrees of freedom increases, the distribution becomes more and more similar to a standard normal distribution.

#13 Let consider the case in which we want to make inferences on the population mean of a characteristic of interest but we cannot assume that at the level of the population the behaviour of such characteristic is described by means of a normal distribution (unknown population).

Confidence intervals with R: t.test(variable_name). By default 95% confidence intervals are worked out. We can set any (1-α)100% level, through the option conf.level as follows: t.test(variable_name, conf.level = 1-α). Confidence interval on the proportion: any inference on the unknown population proportion p is based on the sample proportion denoted by p[hat]. It is the proportion or percentage of times the outcome of interest is observed in the sample. The sample proportion is the sample mean of the sample of variables (X1 , … , Xn ) i.i.d with Bernoulli distribution of parameter p.

#15 A hypothesis is a statement on a population parameter that specifies a value or an entire range of values. An hypothesis is said to be simple if it specifies one single value of the parameter. An hypothesis is said to be composite if it specifies an entire range of values. A hypothesis testing problem is described by formulating two hypotheses: a null hypothesis and an alternative hypothesis. The two hypotheses are formulated so that only one of the two is true. We will never know which one is true. A hypothesis testing problem is a decision problem , we have to decide whether to reject the null hypothesis or to not reject the null hypothesis, based on the empirical evidence provided by the sample. When  we reject H0, we decide to act as if H0 were false and H1 true. We presume that the null hypothesis is true unless the data provide enough evidence against it. The alternative hypothesis is the analyst research hypothesis.

The best decision rule is then the one that minimizes both the probability of making Type I error and the probability of making Type II error. Unfortunately a rule of this kind cannot be found. It can be shown that a decrease in the probability of making one type of error is associated with an increase in the probability of making an error of the other type. There is a trade-off between the probabilities of making the two errors. The worst error is Type I error. Remember that in the choice of the hypotheses the null hypothesis is the one that depicts the current situation, so that once rejected a change is needed and therefore costs need to be afforded. In this sense we understand way Type I error is the worst one. When making Type I error, we afford costs that are not necessary. When making a Type II error, no action is needed, no change in the current situation is needed, no costs are afforded.

The decision rule we use is specified in such a way to control for the probability of making a Type I error. The probability of Type I error is called significance level of the test and it is denoted by α. A significance test of level α is a testing procedure such that when applied we have a probability of making Type I error equal to α.

In summary:

#16 An hypothesis testing problem can be solved as well based on a different decision rule, that is specified in terms of the p-value. We call test statistic the function of the sample that we use for rejecting the null hypothesis. We call p-value of a test the probability of observing a value of the test statistics more extreme  than the observed one assuming t hat the null hypothesis is true. More extreme is meant in the direction specified by the alternative hypothesis.

For a fixed significance level α, we reject the null hypothesis if the p-value is smaller than α.

P-value interpretation: ● Assuming that the unknown population mean is equal to µ0, the p-value returns the probability of observing a value of the sample mean falling above µ0 by a higher number of standard errors compared to the one that is observed. ● Assuming that the unknown population mean is equal to µ0, the p-value returns the probability of observing a value of the sample mean that is less plausible than the one observed.



Assuming that the unknown population mean is equal to µ0, the p-value returns the probability of observing a value of the sample mean that provides even stronger evidence against the null hypothesis than the observed one.

Test on the mean of a normal population with unknown variance: the tests of significance level α on the mean of a normal population with unknown variance have the same mathematical structure of the tests in the case of known variance. Differences are: the standard error is estimated and quantiles of a Student’s t distribution are used, replacing the quantiles of a standard normal distribution.

For a Student’s t random variable with n-1 degrees of freedom, the probability of observing a value smaller than or equal to a given x, is obtained through the command pt(x,n-1).

Test on the mean of an unknown population: the tests of significance level α on the mean of an unknown population have the same mathematical structure of the tests in the case of a normal

population with unknown variance. The unique difference is that the quantiles of a Student’s t distribution are replaced by the quantiles of a standard normal distribution.

The p-value of the tests on the mean of an unknown population are defined as in the normal case and worked out using the standard normal distribution. #17

Test on the population proportion:

#18 We know how to assess, at the level of the sample, whether there is association between a numerical variable and a categorical variable. The approach is asymmetric, the numerical variable is taken as the variable that we want to explain and predict based on the categorical one, drawing on their association. The categorical variable splits the sample into two sub-samples and we compare the behavior of the numerical variable in such two groups. The comparison can be run graphically by plotting a boxplot of the numerical variable in the two groups or by working out synthetic measures as mean and standard deviation of the numerical variables in two groups.

In order to answer to the above question, we need to address an hypothesis testing problem on a new parameter that is the difference between the mean of two populations.

Sample means difference, NORMAL POPULATIONS:

The p-values of the significance tests on the difference between two populations mean are defined as in the case of the tests on the mean of one population. In the case of normal populations they are worked out using the Student’s t distribution with (nX+nY − 2) degrees of freedom. #19

Test on the difference of two populations mean – unknown population: we keep assuming that they have the same variance. The test statistic is the same as before. It can be shown that for large samples sizes nX and nY the test statistic has a standard normal distribution.  The only difference is in the use of the quantiles of a standard normal distribution that replace the quantiles of the Student’s t distribution.

Test on the mean difference with R:

Association- Independence: on a sample randomly drawn from a population, let assume to measure two characteristics taking not numerical values. From the information delivered by the sample we want to assess whether there is association between the two categorical variables. We need to decide if there is enough empirical evidence to presume that the two variables are not independent.

So we use a Chi-square test of independence. It tests at significance level α the null hypothesis H0: the two variables, X and Y, are independent against the alternative H1: the two variables are not independent.

The higher is the value of the chi-square test statistic the higher is the evidence against the null hypothesis. It can be shown that the chi-square test statistic has under the null hypothesis a known distribution called chi-square distribution with (r-1)(c-1) degrees of freedom.

For a chi-square distribution with k degrees of freedom, the probability of observing a value smaller than or equal to a given x, is obtained through the command pchisq(x,k). The quantile of order (1-α) of a chi-square distribution with k degrees of freedom, is worked out as follows qchisq(1-α,k)....


Similar Free PDFs