Week1 notes - ECON 3640 PDF

Title Week1 notes - ECON 3640
Author Thomas Yassmin
Course Actuarial Math
Institution University of Utah
Pages 8
File Size 271.6 KB
File Type PDF
Total Downloads 12
Total Views 176

Summary

ECON 3640...


Description

ECON 3640-090 Probability & Statistical Inference Lecture notes - Week 1

1

Fall 2019 M´arcio Santetti

What is statistics?

Statistics embodies the practice of collecting and analyzing data, either quantitative or qualitative, with the purpose of describing, measuring, and inferring relevant information from it. As an example, suppose you are curious about how students performed in this class in previous semesters. If I say that the average mark from the two previous terms was 85/100, what would you conclude from it? Regardless of considering this mark high or low, or whether this fact implies an easy or difficult class, you are already doing Statistics! In this case, what matters is that you are taking a piece of information (in this case, numerical), and drawing conclusions for this semester based on previous data. This is the essence of Statistics: transforming raw numbers/data/information into accessible conclusions for diverse populations of interest. During this semester, we will concentrate our studies in three main blocks: (i) descriptive statistics, (ii) probability theory, and (iii) statistical inference. Let us start with the first.

2

Descriptive statistics

Descriptive statistics involves the methods of organizing, summarizing, and presenting data in a convenient and informative way. For some cases, it may be more interesting to present data in a graphical way, such as data for the US GDP series from 1950 to 2019; for others, a numerical presentation suffices, as for the average grade in previous semesters for this class. For some other situations, a table may be more interesting, as when we have 50 students in a classroom and want to gather some data about them, such as their age, major, and high school GPA. It is up to the practitioner deciding which descriptive technique is the most convenient to address the problem at hand. We will learn the most relevant ones in this part of the course. It is important to note at this point that descriptive measures are just a means to an end. The goal of presenting a numerical measure, a graph, or a table should be providing information such that a statistics practitioner can draw some conclusions about the sample or population that is being analyzed. The interpretation of descriptive measures applied to entire populations is the purpose of Statistical Inference, which unites descriptive techniques along with probability rules in order to draw conclusions (that is, inference) about characteristics of populations based on sample data.

1

3

Some key statistical concepts

To start of f our studies, let us first clearly define two of the most important statistical concepts: population and sample. 3.1

Population

A population is a group of all items of interest to a statistics practitioner. It does not necessarily need to involve a population of different persons: it can be a population of plants, firms, rocks, houses, cats, etc. A descriptive measure of a population is called a parameter. Now, consider the following situation: you are interested in how many hours UofU students spend at the library per week. Are you able to get information from all students? It would be very costly (especially in terms of time) to gather data from all students. However, we can still draw very precise and interesting conclusions about these students by using a good sample of students. 3.2

Sample

A sample is a set of data drawn from the studied population. A descriptive measure of a sample is called a statistic. It is used to make inferences about population parameters. From our previous example, if we are able to get a good sample (both quantitative and qualitatively) of UofU students, our statistical analysis tends to be very precise. In statistical practice, it is almost impossible to work with populations. The general practice is to have samples, and then using inferential techniques that imply conclusions to the overall population, without incurring into all the costs of having data from every single component of the population. Usually, a fraction of these data are enough. This is one of the most powerful features of Statistics.

2

4

Types of data and information

In Statistics, a variable is a characteristic of a population or sample. There are three main types of data in Statistics: (i) interval, (ii) nominal, and (iii ) ordinal data. 4.1

Interval data

Interval data is also known as quantitative or numerical data. As an example, consider the following midterm grades: 67 72 89 92 67 21 100 99 88 For this type of data, there is no need to put the observations in order. Moreover, all descriptive calculations (which we will see in more detail soon) are possible with interval data. As a quick exercise, calculate the average mark from the above data set. 4.2

Nominal data

Nominal data comprehends qualitative or categorical variables. As an example, we can consider the following marital status: Single Married Divorced Widowed We can also assign number codes to each category. Single=1 Married=2 Divorced=3 Widowed=4 Here, the magnitudes, as well as the ordering of the codes do not matter. Therefore, any other numbering system works for nominal data. If this is not the case, it would be implied by this example that being widowed is better than being divorced, which is not the point of the given example. For this type of data, numerical descriptive techniques do not make sense. As a quick proof, take the average of the marital status codes above and try to give some meaning to it. However, we can at least calculate and report relative frequencies (%). As an exercise, suppose that, from a sample of 7 individuals, 2 are single, 3 are married, 1 is divorced, and 1 is a widower. Report the relative frequencies of each marital status for this sample.

3

4.3

Ordinal data

As the name implies, for ordinal data the order actually matters. The numbering procedure is the same as for nominal data, but now the codes mean an order of preference/strength. As an example, consider the following data for the evaluation of a service: Poor=1 Fair=2 Good=3 Very Good=4 Excellent=5 In this case, an Excellent evaluation is better than a Very Good one. Therefore, coding matters for this type of data. Any other numbering system would also work here, as long as the order remains unchanged. For ordinal data, ranking the categories (in any order) is a useful way to describe it. Moreover, the median (which we will study soon) may be a useful descriptive technique.

5

5.1

Graphical descriptive techniques

Nominal/ordinal data

Sometimes, a table describing nominal (or ordinal) data may not be the best option to catch a reader’s eye. Therefore, some graphical techniques can be useful, such as the bar and the pie charts. The table below describes different work status for a sample of 1,973 individuals. Note that a numerical code (from 1 to 8) is assigned to each job category, and absolute and relative frequencies are also given.

Work status

Code

Count

Relative frequency (%)

Full-time Part-time Temp. not working Unemployed Retired In school Housekeeping Other

1 2 3 4 5 6 7 8

912 226 40 104 357 70 210 54

46.22 11.45 2.03 5.27 18.09 3.55 10.64 2.74

To illustrate the absolute frequency, that is, the total number of observations fitting each work status, a bar chart can be useful. The figure below shows it.

4

Bar chart work_status 750

full−time

count

housekeeping other 500

part−time retired school

250

temp−not−working unemployed 0 2

4

6

8

code

If, on the other hand, we want to illustrate relative frequencies, that is, the percentage with which each category appears in the data set, a pie chart is the best option, as shown below for the job status example.

Pie chart

1 2 3

4

5

6

8 7

In case we are working with ordinal data, using a bar chart in ascending or descending order is a great option to describe a data set. The pie chart is used in the same way as shown above.

5

5.2

Interval data

One of the most common ways of presenting interval data is through a histogram. As an example, consider the following data set for clients’ ages at a restaurant during breakfast hours: 1 3 27 32 5 63 26 25 18 16 4 45 29 19 22 51 58 9 42 6 To construct a histogram, the easiest way to start is by following a recipe. These are the steps: 1. Find the data set’s lowest and the highest values; 2. Define the appropriate intervals (bin size), and the number of observations contained in each interval; 3. Draw the histogram. From our example, the lowest value is 1, and the highest is 63. As for the second step, there is no rule for the choice of the bin size; it is up to the practitioner, depending on the type and the range of observations she has at hand. Since in this example we are dealing with ages, 10–10 intervals may be appropriate. The following table summarizes the second step. Notice that there is no intersection between the limits of each interval. Therefore, if one interval is 0–9, the next one must start with the next number, that is, 10–19. If two intervals contain the same observation, we will have a double counting problem.

Interval/bin

Frequency (#)

0–9 10–19 20–29 30–39 40–49 50–59 60–69

6 3 5 1 2 2 1

As the second step is complete, we just need to draw the histogram. On the y-axis, we put the frequency; on the x-axis, the intervals. Notice that, in the histogram below, the bin sizes are the same, and there are no gaps between the bars.

6

4 3 2 0

1

Frequency

5

6

Histogram of ages

0

10

20

30

40

50

60

70

ages

5.3

Graphing the relationship between two interval variables

Suppose that we want to visually describe the relationship between two interval variables, such as sales vs. advertising expenditures; wages vs. years of schooling ; or stock prices vs. people’s interest. The technique to be used is the scatter diagram. As an example, let us consider the relationship between the price and the size of houses:

Size(f t2 )

Price (US$ 1,000)

2,300 1,800 2,600 2,000 1,500

315 230 355 260 215

To draw the scatter plot, we must define an independent (explanatory) and a dependent (explained) variable. For our example, it makes sense to assume that the price is dependent on the size of a house, and not the other way around. Finally, to draw it, we combine each price and size (i.e., each row from the table) with a dot in their corresponding places in the plotting space. For this case, there is no need to connect the dots. The graph below describes this relationship. How would you interpret it?

7

Price vs. size of a house 360



price

320



280 ●

240 ●



1500

1800

2100

2400

size

As a second example, we can study the association between sleeping time and minutes worked per week. For this case, we have more observations. As it is possible to notice, we have a slight negative association between these two variables: as minutes worked increase, sleeping time per week, on average, decreases.

3000 1000

sleeping time/week

Sleeping time vs. hours worked

0

1000

3000

5000

minutes worked/week

8...


Similar Free PDFs