ECON1203 - Summary Business and Economic Statistics PDF

Title ECON1203 - Summary Business and Economic Statistics
Course Business and Economic Statistics
Institution University of New South Wales
Pages 72
File Size 4.3 MB
File Type PDF
Total Downloads 411
Total Views 932

Summary

Part 1 Exploring and Collecting Data (Chapters 1 to of Variable: A characteristic of a population or of a sample from a A data set contains observations on Variables may Quantitative (numerical) or qualitative (nominal or Quantitative: A numerical statistic that is quantifiable. EG:Exam Qualitative:...


Description

Part 1 - Exploring and Collecting Data (Chapters 1 to 4) Types of Data ● Variable: A characteristic of a population or of a sample from a population ○ A data set contains observations on variables ● Variables may be: ○ Quantitative (numerical) or qualitative (nominal or ordinal) ■ Quantitative: A numerical statistic that is quantifiable. EG:Exam scores and time ■ Qualitative: A non-numerical statistic. Gender is a nominal qualitative variable ○ Discrete or continuous ■ Discrete: A variable that has a finite number of possibilities. All qualitative variables are discrete and very few quantitative ones are discrete. EG: Football scores (1:0, 2:1, etc) ■ Continuous: A variable that as an infinite number of possibilities. Can be continually magnified into decimal places. EG: Time remaining on a football game, i.e. 1:00, 1:01, 1:02, etc. ○ Ordinal qualitative data feature a natural ordering (When the values of a categorical variable have an intrinsic order) ■ EG: A+ is greater than B■ EG: Course evaluations (poor, average, good) ○ Nominal data on the other hand, is a categorical variable with unordered categories (nominal qualitative variables are not able to be placed in an order, i,e. One is not definitively better than the other). ■ EG: Men, women (Gender) Types of Observations ● Time series data consists of measurements of the same concept at different points in time ○ EG: Sydney-area births per day for each day in a year ● Cross sectional data consists of measurements of one or more concepts at a single point in time ○ EG: Age, gender and marital status of a sample of UNSW staff in a particular year Describing “Big Data” ● [from Wikipedia] Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying and information privacy.

Summarising a Categorical Variable ● To make sense of big data and to easily organise and communicate it firms can use frequency tables. A frequency table records the counts for each category and tabulates it. Furthermore some tables also reveal the relative frequency (the percentage). Descriptive Statistics 1: Frequency Distributions ● Summaries of categorical data using counts ○ EG: UNSW is interested in how students get to campus, for long-term planning ○ Note that categories need to be mutually exclusive and exhaustive ○ Mutually Exclusive: Taking one option means you cannot take the other ○ Exhaustive: At least one of the options must be taken ● Bar charts and pie charts graphically represent frequency distributions ○ A bar chart displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison. ○ A pie chart shows how a whole group breaks into several categories. They show all the cases as a circle sliced into pieces whose areas are proportional to the fraction of cases in each category. Contingency Tables and Conditional Distribution ● A conditional table is a table showing the frequency distribution between two variables, in order to show any correlation between the two variables. The frequency distributions in the margins (marginal distributions) show the count per variable.

Descriptive Statistics 2: Histograms ● For quantitative variables, there are no categories. Instead, all values are categorised into bins where the height is determined by the distribution into these bins. Note that in histograms there cannot be any gaps between bins or it could mislead the reader into thinking that there are holes in the data. ● Suppose data are ordinal (whether discrete or continuous) ○ Obvious categories for the data values may not exist ○ But we can create categories or classes by defining lower and upper class limits (These categories must be mutually exclusive and exhaustive) ● EG:





○ Categories are: 0-49, 50-64, 65-74, 75-84, 85-100 Note that: ○ Gaps don’t look great between bars in a histogram ○ Bin label default can be confusing (mid points are preferable to upper or lower limits) ○ Bar areas should be proportional to frequencies ○ EG (Bad-looking histogram):

We can convert the information contained in frequency distributions into: ○ Cumulative frequency or relative frequency distributions ■ Aussie marks: How many students got a credit or better? ■ Associated cumulative histograms



Stem and leaf displays ○ Similar to a histogram, the height (length in this case) of the bin is shown to describe the distribution, but individual values are als shown. ○ EG:

Describing Histograms ● When describing histograms, three things are important; its shape, its centre and its spread. Shape ● The shape of data can be described through: ○ Mode: The most frequent score. Refers to humps in the spread. (A histogram can be unimodal, bimodal, multimodal) ○ Symmetry: A distribution is symmetric if the halves on either side of the centre look, at least approximately, like mirror images. ■ The thinner ends of the distributions are called tails and can be positive or negatively skewed. (Long tail to the right = Positively skewed, Long tail to the left = Negatively skewed) ○ Outliers: Are any values that stand off from the distribution, outliers can affect almost every method and must be looked out for. Centre ● The centre refers to the most typical score in the distribution. This can be measured by looking at the mean and median. ○ The mean is the average of all scores. ■ It is shown by the following formula:

(Note that sigma means sum) The formula above shows that to find the mean, we add up all values of the variable, x, and divide that total by the number of data values, n. The median splits the histogram into two equal areas. (Make sure to place the values in ascending order) ■ When the number of values is odd, the median will be a particular value (The middle score) ■ When the number of values is even, the median will be the average of the two middle values. ■



Spread ● Spread measures how varied or similar values are in a distribution. There are four measures of spread: ○ Range: Range is simply the largest score subtracted by the smallest one. It is a single number and not an interval of values and can be greatly impacted by outliers. ○ Quartile Range: The interquartile range summarises the spread by focusing on the middle half of the data. It is defined as the difference between the two quartiles: IQR = Q3 - Q1. EG:



Standard Deviation: The standard deviation measures how many scores vary from the mean.



Variance: Measures average squared distance from the mean. (Shown by the standard deviation squared).

What Should You Report ● If the shape is skewed report the median and interquartile range, if it is unimodal and symmetric use the mean and standard deviation (if there are outliers, calculate them with and without the outliers and make note). ● For unimodal symmetric data, the IQR is usually a bit larger than the standard deviation. ● Always pair the median with the IQR and the mean with the standard deviation. It is not useful to report a measure of center without a corresponding measure of spread.

Standardising Variables ● Standardising allows for the comparison of two scores from different distributions. This is done from the z-score which calculates the amount of standard deviations away from the mean. ● The formula is as shows:

○ ●

The z-score is essentially the observed score minus the mean and the value of that is then divided by the standard deviation

EG:

Five-Number Summary and Box Plots ● The five number summary of a distribution display its median, 1st and 3rd quartiles and extremes (min and max). ● A box plot is a representation of the five number summary. The orange box represents the quartile range, the whiskers are the further scores not considered outliers and the black horizontal line is the median. Outliers ● Outliers are scores that are distant from the rest. ● When looking at outliers, they must be contextualised, other than that outliers can be mistakes due to incorrect units or misplaced decimals, etc.

Time Series Plots ● A time series plot is a display of values against time and can be used to show patterns in the distribution. The plots in a time series can be smoothed out to understand underlying trends. ● In other words, it plots a bivariate relationship between some variable and time ● EG: Petrol Prices

Key features: ● Upward trend with increasing volatility ● Spike in 1990 - Gulf war ● Gap in data in 1997? ● Sydney and Melbourne prices move together whilst Brisbane prices are generally lower, especially in later periods



EG 2: Petrol Prices (daily price movements in Sydney, winter 2006)



Key features: ■ Notable price variation from day to day ■ From these data, we can determine the day of the weekly peak and trough ■ Common day for prices to peak was Thursday whilst troughs occur in Tuesday ■ Other days did not have such peaks or troughs

Scatterplots ● Graphs which represent the relationship between two variables. When examining scatter plots relationships can be derived from the slope. ○ If a graph has a positive gradient then it has a positive linear relationship ○ If a graph has a negative gradient then it has a negative linear relationship. ○ Another way to examine scatterplots is to look at its strength. If it is tightly clustered in a single stream, it is considered to have a strong scatter whereas if it is highly spread out and no pattern exists, it is considered to have a weak scatter. ○ Pay attention to outliers ● The x-axis runs from side to side (horizontal), while the y-axis runs up and down (vertical). The x-axis plays the role of the explanatory or predictor variable while the y-axis plays the role of the response variable.

● Association ● Association measures the relationship between the x and y values of the variables. I.e, do large values of x tend to be associated with large values of y ● It can be measured through covariance which is given by the formula:





A positive covariance indicates a positive linear relationship. However, covariance is not scale free and depends on the units of measurement used. ○ I.e, covariance between height and weight depends on the units of measurement of each variable To get a standardised version, correlation is used.

Correlation ● Correlation is a number between -1 and 1 that refers to the strength of the linear relationship with -1 and 1 being perfect relationships. (-1 being a perfect negative relationship and 1 being a perfect positive relationship). ● If it is close to -1 or 1, it indicates a strong linear relationship, however, if it is close to 0, it indicates a weak linear relationship. ● It is represented by the letter “r”



○ Covariance / SD of x multiplied by SD of y Before checking correlation, three conditions need to be settled; the variables must be quantitative, they must be in linear form (not scattered or curved), and if outliers exist, the correlation should be reported with and without them.

Lurking Variable and Causation ● When examining correlation, it is imperative that causality is not linked with correlation, this is mainly due to lurking variables. A lurking variable is a third variable that is impacting the other two variables meaning that causality is invalid despite a strong correlation. ● EG: Research conducted shows strong association between height and reading ability among elementary school students. However, it does not necessarily mean that taller kids have greater reading ability. A lurking variable exists here, that is, the kids’ ages. Older kids tend to be taller and thus have better reading skills. Therefore, it is important not to mistake reading ability with height as there is no causality here. Linear Model ● If there is a discernable pattern then there is association ● We can model the relationship with a line and give the equation ● A linear model is a measure of association and is merely a straight line through the data on a scatter plot . While the points rarely line up, the line can summarise the data. ● The points in the scatterplot don’t all line up, but a straight line can summarize the general pattern with only a few parameters. ● Residuals: The difference between predicted and actual. ○ Tells us how far the model’s prediction is from the observed value at that point.





Any linear model can be written in the form

where b0 and b1 are

numbers estimated from the data and is the predicted value. We use the ^ to distinguish the predicted value from the observed value y. The difference between the predicted value and the observed value, y, is called the residual and is denoted e.

. The line of best fit (regression line) ○ Some residuals will be positive and some negative ○ The smaller the sum of the squares of the residuals, the better the fit. ○ The line of best fit is the line for which the sum of the squared residuals is smallest – often called the least squares line. Correlation and the Line ● Linear model are written using the equation y-hat = b0 +b1x. ○ b1 refers to the slope of the line and it will have the same sign as the covariance. ●

■ ■





Correlation multiplied by ratio of SD of y to x The slope tells us that for every 1 unit more of x, y will go up by that slope. b0 is the intercept and is found using the equation below:

■ Mean of y subtracted by product of the slope and mean of x ■ The intercept is the value of the line when the x variable is zero. Any difference between the actual variable and the prediction from the line is known as the residual.

Regression to the Mean ● If x is 2 SDs above its mean, we won’t ever move more than 2 SDs away for y, since r can’t be bigger than 1 ● So each predicted y tends to be closer to its mean (in standard deviations) than its corresponding x was. ● This property of the linear model is called regression to the mean. This is why the line is called the regression line.

Checking the Model ● Quantitative Variables Condition ● Linearity Condition ● Outlier Condition (no point needs special attention – outlying points can dramatically change a regression model). Variation in the Model and R2 (Pronounced R squared) ● If the linear model were perfect, the residuals would all be zero and would have a standard deviation of 0. ● The squared correlation R2 gives the fraction of the data’s variation accounted for by the model. ● Because R2 is a fraction of a whole it is given as a percentage. ● The value should always be between 0% and 100%. ● R2 of 100% is a perfect fit, with no scatter around the line. .

Part 2 - Modelling with Probability (Chapters 5 to 9) Random Phenomena and Probability ● When examining random phenomena, each attempt is known as a trial and the result each trial generates is known as the outcome. ○ A combination of outcomes is known as an event (EG: If the outcome of rolling at least a 4 on a die, the event is 4,5,6) ○ All possible outcomes that can come from a trial is known as the sample space which can be written as S (the sample space for tossing two coins is HH, HT, TH, TT) ● The probability of an event is its long-run relative frequency (i.e, the percentage of it occurring) ● Independence means that the outcome of one trial does not influence or change the outcome of another (I.e, when one occurs, it does not affect the probability of the other one from occuring) ○ Eventually, the relative frequency of any outcome gets closer to one value, that value being the probability. This idea is the law of large numbers. (LLN) ○ This idea can be wrongly interpreted as that something is bound to go up if it has been down for a large period of time, however because this change occurs in the short term and the outcomes don’t even out until the long-run, it is impossible to predict the short term. ○ This is because the outcomes are independent of each other. ● Empirical Probability: The relative frequency of an event’s occurrence as the probability of the event itself. Different Types of Probability ● Theoretical Probability: If each outcome is equally likely, then we can find the theoretical probability by dividing the desired outcome by the number of outcomes. ○ P(A) = # of outcomes in A / Total # of outcomes possible ● Personal Probability: Is a probability derived from an individual’s personal judgement about whether a specific outcome is likely to occur. It contains no formal calculations and only reflects the subject’s opinion and past experience. Probability Rules ● Rule 1: If the probability of an event is 0, the event won’t occur. However, if the probability of an event is 1, the event will always occur. ○ The probability of an event cannot be negative or greater than 1. ○ For any event A, 0...


Similar Free PDFs