1st part Statistics - 30001 Bemacc class 12 PDF

Title 1st part Statistics - 30001 Bemacc class 12
Author Francesca Ernani
Course Statistica / Statistics
Institution Università Commerciale Luigi Bocconi
Pages 13
File Size 874.4 KB
File Type PDF
Total Downloads 281
Total Views 378

Summary

nullData are collected on statistical units with respect to given characteristics of interest. Micro data are collected on individuals: consumers, voters, employees... Macro data are collected on groups of units: regions, countries, organizations...Population (​N​) is the set of all statistical unit...


Description

1 #1 Data are collected on statistical units with respect to given characteristics of interest. Micro data are collected on individuals: consumers, voters, employees… Macro data are collected on groups of units: regions, countries, organizations… Population (N ) is the set of all statistical units object of interest. Sample (n )  i s a subset of the population. Inferential process consists in drawing conclusions that concern the entire population from the analysis of a sample drawn from the population. The quality of such conclusions depends first on the sample. The sample needs to be representative of the population: it has to be obtained in such a way not to favor some parts of the population over others. Non-probability sampling: units are drawn according to the judgment of the researcher. Probability sampling: units are drawn from the population at random. It is the random generation of data that ensures that the sample is representative of the population. Simple random sampling is a procedure for selecting a sample of size n  such that: 1. Each unit is drawn at random from the population 2. Units have the same probability of being drawn from the population 3. Samples of a given size n h  ave all the same probability to be drawn A parameter is a numerical summary of a characteristic at the level of the population. A statistic is a numerical summary of a characteristic at the level of the sample. The idea is to use sample statistics to make inferences about population parameters based on sample statistics. Descriptive statistics: a collection of techniques that make it possible to describe sample data through sample statistics. Inferential statistics: a collection of techniques that make use of sample statistics to learn on population parameters. A variable is the result of a quantitative measurement of a characteristic of interest. Categorical variable takes on non numerical values, so to describe membership of a group or category. ● The values taken by a categorical variable are also referred to as levels or factors. ● Two kinds of categorical variable: nominal and ordinal. ❖ Nominal variables cannot be ranked and they only communicate differences in the units of analysis with respect to the characteristics (for any two units we can only say whether they have the same value of the variable or not). ❖ Ordinal variables can be ranked and they communicate the relative amount of the characteristic being measured. Numerical variable takes on numerical values, providing a numerical measure of the size or extent to which a unit has a characteristic. ● Two kinds of numerical variable: discrete and continuous. ❖ A numerical variable is discrete when it takes on a finite number of values or infinite but countable. ❖ Numerical continuous variable can take any value between any two numbers. (eg. Which is the price of the houses in the surroundings of your home?)

Variables type is not an a priori. The type of a variable is the result of the measurement strategy. The same characteristic or concept can be measured in different ways, by variables of different kinds.

2 #2 GSS (General Social Survey) dataset: a full-probability, personal-interview survey designed to monitor changes in both social characteristics and attitudes, conducted in the United States.

Frequency distribution tables provide the best description of one variable, whenever it takes few values.

A frequency distribution table can be graphically represented by means of a pie chart or a bar chart.

Histogram: intervals are reported on the horizontal axis. On each interval a bar is drawn having an area equal to its relative frequency or percentage. The height of the bar (to be read on the vertical axis) is called interval density and is given by the relative frequency divided by the interval width.

Each column (variable) is accessed by typing d  ataframe_name$var_name (eg. d  ata_GSS2012_selection$SEX[1:3] ). The frequency distribution (counts) table is obtained as follows: tablename = table(variable_name) The relative frequency distribution table is obtained as follows: p  rop.table(tablename) Eg: t ab_DEGREE = table(data_GSS2012_selection$DEGREE) p  tab_DEGREE = prop.table(tab_DEGREE) b  arplot(tab_DEGREE) to plot the counts barplot(ptab_DEGREE)  to plot the percentages

3

Eg: histogram with 9 equal length intervals is hist(data_GSS2012_selection$TVHOURS,9,freq=FALSE) Histogram with 11 intervals set at specific locations is hist(data_GSS2012_selection$TVHOURS,c(0,1,2,3,4,5,7,8,10, 12,15,18)) #3 The measures of central tendency are ● a typical level in the case of a categorical variable ● a typical or average number in the case of a numerical variable How to measure the central tendency of a variable depends on the variable type. The mode can be used for all kinds of variables, that have a manageable number of distinct values. It is the unique measure that can be used for a categorical nominal variable. It is the level or value of a variable that is observed with the highest frequency. It is a typical value in the sense of the value most observed in the sample. The median is computed based on the possibility of ranking the variable's values. It can be used for describing categorical ordinal and numerical variables. It is the central value or observation, it splits the sample into two parts, half cases taking value below and half cases taking value above. (In the even case, we can take as median any of the two observations or the average between the two values.) Cumulative percentage: the percentage (relative frequency) of cases showing a value smaller than or equal to the considered one.

Two situations can occur: ● The cumulative percentage does not take value 0.5 at any of the distinct values. We take as median the first value or level at which the cumulative percentage is above 50%. ● The cumulative percentage takes value 0.5. We can take as median the value at which the cumulative percentage is equal to 0.5. Using R: ➔ in the case of a categorical ordinal variable the median is worked out from the frequency distribution table, as the value at which the cumulative percentage reaches 0.5 or is greater than 0.5 for the first time. To find the cumulative percentages: c umsum(ptab_DEGREE). ➔ In the case of a numerical variable the median is computed by applying the command median: median(data_GSS2012_selection$TVHOURS). The mean can be used only for measuring the central tendency of numerical variables. It is the arithmetic average of all variable values.

4 We call deviation the difference between each variable value and the mean. The deviation is positive whenever the value is above the mean and negative when the value falls below the mean. The deviation can be seen as measure of the distance between the value and the mean. The sum of all deviations from the mean is equal to 0. The sum of total distances above the mean is equal to the sum of the total distances below the mean. The mean is the balancing point of the observations. If we were to place identical weights on a line at the points representing where the observations occur, the line would balance by placing a fulcrum at the mean.

When a frequency distribution table is provided, the mean is given by the weighted average of the variable values with weights given by the percentages:

Using R, mean(data_GSS2012_selection$TVHOURS). In R, missing values are represented by the symbol NA (not available). Eg: x ...


Similar Free PDFs