ECON 1203 Notes PDF

Title ECON 1203 Notes
Course Business and Economic Statistics
Institution University of New South Wales
Pages 90
File Size 7.6 MB
File Type PDF
Total Downloads 114
Total Views 334

Summary

ECON 1203Data Types, Summaries and VisualisationDefinition (What are Data?)A datum is a single measurement of something on a scale that is understandable to both the recorder and the reader. Data are multiple such measurements.Common issues with data:- Missing values: how do we fill in? - Wrong valu...


Description

ECON 1203 Data Types, Summaries and Visualisation

Definition (What are Data?) A datum is a single measurement of something on a scale that is understandable to both the recorder and the reader. Data are multiple such measurements. Common issues with data:

-

Missing values: how do we fill in? Wrong values: how can we detect and correct? Messy format & not usable (the data cannot answer the question posed)

A variable is a characteristic of a population or of a sample from a population

-

We observe values or observations of a variable A data set contains observations on variables

Note. The first stage in our big picture determines/defines the population characteristic of interest Types of variables 1. Quantitative (numerical) or qualitative (categorical: nominal, ordinal)

-

Exam scores and time are quantitative variables Gender is a nominal qualitative variable Ordinal qualitative data feature a natural ordering E.g. course evaluations: poor, good, very good or Standard & Poor’s ratings

Note. In order to apply statistical analyses directly to qualitative data, we must convert it somehow to quantitative data. A population is the entire set of objects or events under study. Population can be hypothetical “all students” or all students in this class. A sample is a representative subset of the objects or events under study. Needed because it’s impossible or intractable to obtain or compute with population data. 1. Discrete Data: a finite number of values are possible in any bounded interval. Takes on only integer values (whole numbers, no decimals) e.g. number of people in the room 2. Continuous Data: An infinite number of values are possible in any bounded interval. Takes on any value on the real number line (All numbers including decimals) e.g. Rate of population increase, Household income, COVID 19 Infection The type of observation made by the statistician can also be used to classify data a. Time Series data consist of measurements of the same concept at different points in time e.g. Sydney-area births per day, for each day in a year, the daily closing value of the Dow Jones Industrial Average b. Cross sectional data consist of measurements of one or more concepts at a single point in time. It is a type of data collected by observing many subjects (such as individuals, firms, countries or regions) at the one point or period of time. e.g. age, gender and martial status of a sample of UNSW staff in a particular year c. Panel data, sometimes referred to as longitudinal data, consist of measurements about different cross sections across time. Examples of groups that make up panel data series. Data Types: A. Nominal data. Frequencies, proportions and percentages. Visualisation methods: pie chart, bar chart B. Ordinal data. In addition to the tools available for nominal data, you can use percentiles, median, mode and the interquartile range to summarise your data. C. Continuous data. You can summarise your data using percentiles, median, interquartile range, mean, mode, standard deviation and range. Visualisation using histograms and boxplot. Frequency Distribution- Summaries of categorical data using counts.

e.g.

-

Categories need to be mutually exclusive and exhaustive

Ordinal Data. (Whether discrete or continuous)

-

Obvious categories for the data values may not exist But we can create categories or classes by defining lower and upper class limits Categories need to be mutually exclusive and exhaustive

How many categories? (Excel calls them bins)

-

Too many doesn’t summarise Too few no information No set rules on the number of bins, although having more observations means one generally wants more bins Bins need not be of equal width, and may be open-ended at the top or bottom

-

summarisation, >15 too much detail)

-

Class width~= range/number – class

-

(20000, 40000) means that the lower end is not included in this class, but the upper end is included.

Number of classes: Should be between 5 and 15 ( ( 0.55

-

Our exact level of statistical confidence in this rejection is 1 minus the p value (which in this case .1562) But ethical practice is to choose a level of confidence before we test a hypothesis

-

-

-

In this example, our p-value of .1562 indicates a failure to reject the null hypothesis at the 99%, 95% or 90% levels of significance- which are the conventional levels used by practitioners When we choose our confidence level in advance, the alpha level associated with that confidence level is defined as the maximum p-value that would enable that level of confidence. So, for 99% confidence, α = .01. For 95%, α = .05. And so forth

Predicting Election Results

Finite and Large Sample Distribution Suppose we want to know the mean value of X (µX ) in a population, for example -

The mean wage of university graduates The mean level of education in Australia.

Suppose we draw a random sample of size n with X1, · · · , Xn i.i.d Possible estimators of µX:

-

The sample mean (average) X¯ = But it is not the only estimator 1. The first observation X1 2. A weighted average:

Properties of Estimators To determine which estimator is the best estimator of µX, one considers the 3 desirable properties of an estimator. Let µˆX be an estimator of the population mean µX.

1. Unbiasedness: The mean of the sampling distribution of µˆX equals µX, 2. Consistency: The probability that µˆX is within a very small interval of µX approaches 1 if n ∞

3. Efficiency: If the variance of the sampling distribution of µˆX is smaller than that of some other estimator µ˜X, µˆX is more efficient, Var(µˆX) < Var (µ˜X)

Example: estimating the average wage

Example: Unbiasedness

Example: Mean Wages of individuals with a master degree with µW = 60000

::::LLN Theory

Example: Errors by Bank Tellers

Finite Sample Distribution for the X¯ for n = 2 What is the sampling distribution of the sample mean for n=2?

-

-

We would need to take all possible samples of size 2 from the population For each x1, x2 pair, we then would calculate the sample mean Then, we could count up how many of our samples generated each value for the mean, from which we can calculate the overall probability of drawing, from our population, a sample of size 2 with that mean This is the distribution- across all possible samples of size 2- for the new random variable, called “the sample mean for a sample of size 2”

Finite Sample Distribution for the other statistics

-

The sampling distribution of the sample mean could be obtained in a similar manner for other sample sizes We could also produce the sampling distribution for the sample’s variance (or any other statistic) We have derived this sampling distribution directly, though in general, this isn’t necessary

Finite Sample Distribution for the s squared for n=2

Sampling Distribution of the sample mean

-

We have general expressions for the mean and variance of the sampling distribution of the sample mean But this may not be enough to characterise the whole sampling distribution Are there circumstances where the sampling distribution is known?

Sampling Distribution of the sample mean when X is normal

Approximating the Distribution of X¯

Example: CLT Audition to auditing accounts

Law of Large Numbers Theorem

Hypothesis Testing: Motivation

-

-

Make conclusions about a population parameter 1. The Null Hypothesis H0. The status quo (an existing condition) is defined where the value of a parameter is known 2. Alternative/research hypothesis (the parameter has changed). Something may then occur or change in some way the status quo, so that a new value of the parameter exists 3. Hypothesis testing is a procedure used to assess whether there is sufficient evidence to conclude that the status quo has changed. Hypothesis tasting is not available to provide 100% certainty in any conclusion that is reached Hypothesis testing can result in possible errors when reaching a conclusion that is based on sample data

Example: Chicken Feed

Other Examples:

Hypothesis testing: the process How are data used to test a null hypothesis? 1. Proceed by comparing a test statistic with the value specified in H0 and decide whether the difference is: - Small enough to be attributable to random sampling errors  do not reject H0, or - So large that H0 is “more likely” not to be correct  reject H0 2. Formally define a rejection (or critical) region by your choice of alternative possibilities

-

If values of the test statistic are “extreme enough”- i.e., if they fall in the rejection regionthen they lead us to reject H0 in favour of H1 3. Other possible values of the test statistic, that are not so extreme, lie in the non-critical region One and two-tailed tests 1. A one-tailed test defines the rejection region as one extreme end of the sampling distribution - The null and alternative hypotheses here will look something like: H0 : µ = .7, H1 : µ > .7 2. A two-tailed test defines the rejection region as both extreme ends of the sampling distribution - Here the null and alternatives might look like this: - H0 : µ = .7, H1 : µ 6= .7 3. Also recall: the alpha level associated with a confidence level (e.g. 95%) is defined as the maximum p-value that would enable that level of confidence in our judgment (e.g. 0.05). This is the probability of our statistic falling into the rejection region (for a two-tailed test: split the probability across the two ends of the sampling distribution!) Example: Skills Test

What if σ is unknown

-

-

We have considered the following problems when dealing with the sample mean: 1. Working out its sampling distribution 2. Formulating and conducting hypothesis tests for µ But we have assumed population variance σ squared 1. As with out estimation of the population proportion, this is usually unrealistic 2. We do have available, in any given sample, s squared: a consistent and unbiased estimator of σ squared.

What if σ is unknown & n is large 1. CLT sampling distribution of the sample mean is approximately normal irrespective of the population distribution 2. When σ is unknown and is replaced by s, our standardised test statistic remains approximately (asymptotically) normally distributed - Why? Because in large samples s will be close to σ with high probability (s

)

True for the sampling distribution of both the sample mean/proportion What if σ is unknown & n is small

-

But what about small n? 1. Let us confine ourselves first to sampling from a population in which the target random variable is distributed normally

-

Let our sample be i.i.d. from 1. Recall that linear combinations of normal random variables are also normal, which is why we know the sample mean and its standardized version (using σ) will also be normally distributed 2. But what happens when we standardise using s?

-

Problem: when n small, sampling variation in

-

distortion to the distribution of the standardised variable This later variable is no longer a normal RV even if X¯ is normal New exact probability distribution for the standardised RV!

introduces additional uncertainty and

Gossett’s Student – t Distribution

Properties of a Student- t Distribution 1. Symmetric and unimodal - Looks very similar to the normal, but has fatter tails - Is totally characterised by degrees of freedom v - For larger and larger v, the distribution becomes more and more like a normal distribution 2. Textbook Appendix Table A-6 only provides critical values

Student- t Shape and Properties

Hypothesis testing, σ is unknown Consider the null Hypothesis H0: µX = µ0 and a significance level α, and let x¯ be the observed sample mean in our given sample:

Hypothesis testing about µX when σ is unknown

Interval Estimation: A review

-

Point estimators produce a single estimate of the parameter of interest In many real-world situations, some notion of the margin of error would be useful Interval estimators produce an interval- i.e., a range of values – and a degree of confidence associated with that interval 1. Hence the name confidence interval 2. “how often would you expect the true population parameter to be in this (samplespecific) interval?

Cl for the population when σ is known

-

CI’s for means and proportions typically have similar structure 1. Centred at sample statistics 2. Endpoints are ± some multiple of the standard error (if we don’t know sigma) or standard deviation (if we do know sigma) of the sampling distribution 3. The multiple is determined by the confidence level chosen by the investigator 4. Remember: If you don’t know sigma and have a small sample, use the t-distribution tables to get your bounds- not the Z!

Finite Sample Inference about µX when σ unknown

T probability Table

Law of Large Numbers Theorem

Statistical Accuracy of Hypothesis Testing Hypothesis testing examples and concepts Null Hypothesis and Alternative Hypothesis

How data is used to test a null hypothesis A. Proceed by comparing a test statistic with the value specified in H0 and decide whether the difference is: - Small enough to be attributable to random sampling errors  Do not reject H0, or - So large that H0 is “more likely” not to be correct

 Reject H0 B. Formally define a rejection (or critical) region - Values of the test statistic that are so extreme they lead us to reject H0 in favour of H1. - Other values of the test statistic that are not so extreme lie in the non-critical region Example 1: Quality Control at McDonalds

P-Value Method

Example 2: Students Progress

-

-

Student outcomes are potentially influenced by: 1. Student effort 2. Students’ innate academic capacity 3. Quality of teaching staff and resources 4. Quality of fellow students (peer effects)

Suppose we take the results of a quiz last term as an indicator of student quality that term 1. (This assumption may be flawed- but measurement of student quality, and many other things is difficult) 2. Past results from the same quiz over several years are taken as the relevant population distribution

Hypothesis Testing: A note about types of errors

-

The first thing to remember is that we will never know what the true population parameter is As a result it is possible to make an error because there is always an uncertain element to all of this. Loosely speaking, there are two types of errors we can make: 1. Type I Error: Conclude that it is true, but it is in actuality false 2. Type II Error: Conclude that it is false but it is in actuality true.

Type I Errors

Type II Errors

Calculating Probability of Type II Errors

Power of a Test

Large Sample Inference for Nominal Data Chi-Square Goodness of fit Objective

-

The Chi-square goodness of fit test is used to test whether a given distribution is similar to a specified distribution (such as the Binomial distribution or normal distribution) and use it to test a hypothesis about a population proportion

Idea

-

The Chi-square goodness of fit test compares the expected or theoretical, frequencies of categories from a population distribution with the observed, or actual, frequencies from a sample distribution to determine whether there is a significant difference between what is expected and what is observed.

-

As we’ve seen, data often occurs in nominal (categorical) form

e.g. 1. Private health insurance status and hospital type 2. Preferences elicited in customer satisfaction surveys

-

Such data feature several possible outcomes or categories for the measured phenomenon 1. Categories are mutually exclusive and exhaustive 2. If we think of each respondent/observation as being a trial, this is like a multinomial extension to binomial experiments (like tossing a die0 3. One could imagine an expected or hypothesized distribution of outcomes across the categories, to which we can compare the distribution seen in our sample

-

To compare the observed and expected distributions… 1. We could simply calculate differences in expected and observed category frequencies 2. Our inference problem is to determine whether these differences are statistically large enough to reject the claim that the expected (probability) distribution is, in fact, what the sample data were drawn from The chi square goodness of fit test is used to test the null hypothesis that the observed and expected distributions are the same

-

Example: Benchmarking customer satisfaction

Step 1: The null and alternative hypotheses

The Test Statistic

The Test Statistic and Its distribution

The X squared distribution (The Chi-square distribution)

-

The value of x squared can never be negative

-

As with the t-distribution, the chi-square distribution belongs to a family of distributions, with every distribution uniquely defined according to its degrees of freedom (v) The chi-square distribution is considerably skewed to the right (i.e. it has a strong positive skew) at low values of v As v increases (i.e. as c gets larger) the chi-square distribution becomes more and more normal, as with the t distribution.

X squared Probability distribution

Example: Benchmarking Customers Satisfaction (Cont.)

Contingency Table: Review

-

Recall our private health insurance (PHI) and the mode of transport examples 1. The survey data were summarised in a 2-way cross-tabulation or contingency table 2. The two ways were PHI status and admission to hospital a. PHI status had 2 levels (Have PHI, don’t have PHI b. Admission had 3 levels (not admitted/admitted as private/admitted as public) 3. The two ways were mode of transport and gender a. Mode of transport had 5 categories b. Gender had 2 categories 4. Previously, we used such tables as descriptive tools 5. We also were interested in whether the two events were 6. Now we want to formally test whether these random variables are independent or not independent Q: Is there a statistically significant relationship between these two categorical random variables?

Contingency Table: New take!

-

-

The testing strategy here is similar to that used for the goodness of fit test 1. We compare observed cell frequencies in our sample with those expected under the null hypothesis of independence How do you calculate the expected frequencies? 1. Previously, these followed readily from the hypothesized probability distribution 2. Now H0 simply asserts “independence” (or “homogeneity”, if the categories used for one or both dimensions are not comprehensive) of the event described by one probability distribution with respect to the other

The x squared test for independence

-

To craft our null hypothesis, we thus set up an imaginary contingency table which assumes independence between the two aspects...


Similar Free PDFs