Stat1 - First chapter PDF

Title Stat1 - First chapter
Author Nasir Jones
Course Statistics
Institution Università degli Studi di Trieste
Pages 29
File Size 1.3 MB
File Type PDF
Total Downloads 64
Total Views 136

Summary

First chapter...


Description

Statistics (a.y. 2017/2018)

Matilde Trevisani matildet@deams .units.it home page, teaching page II semester - 2018

Contents Lecture 1

2

General information (Part)

2

1 Synopsis

2

Introduction (Part)

6

1 What is Statistics? 1.1 Some definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 An example of Textual Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 6 7 8

2 Data Collection 16 2.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Displaying univariate distributions (Part)

21

1 Displaying categorical variables 22 1.1 Frequency tables and diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.2 Bar and Pie Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1

1 SYNOPSIS

2

Lecture 1

General information (Part) 1 Synopsis Logistics Lecture/Lab Times and Location Monday Thursday (Lab) Friday

12-15 15-17 9-12

Room “Aula Mappe Antiche”, Via Tigor 22 Classes Lectures (60 hours) are alternated with practical sessions (20 hours). Instructors 1. Prof. dr. Matilde Trevisani [email protected] home page, teaching page (write instructor’s name in the proper box to recover his/her courses ... ,)

DEAMS, Via Tigor 22, 2nd floor, room 205 60h lectures. 2. Dr. Elvira Pelle [email protected] home page

20h practicals. Readings Textbook For business statistics courses taught in Economics and Business Schools Newbold P., Carlson W.L. and Thorne B., Statistics for Business and Economics, Global Edition, 8th Edition (2013), Pearson / Prentice Hall

http://www.pearsonhighered.com/educator/product/ Statistics-for-Business-and-Economics-8E/ 9780132745659.page

Matilde Trevisani

Statistics

a.y. 2017/2018 II s.

1 SYNOPSIS

3

Other reference books: David S. Moore, William I. Notz, Michael A. Fligner, The Basic Practice of Statistics, 6th Edition (2013), W.H. Freeman Publishers (there is also a newer edition, 7th, 2015) Freedman, Pisani, and Purves, Statistics, 4th Edition (2007), W. W. Norton & Co., Inc. Course lectures will be posted on the class website (federate Moodle) on the day of the lecture. Lecture and exam calendar Lecture calendar A.A. 2017/18 • February 19 - March 29 Easter holidays March 30 - April 3 • April 4 - May 19 Exam calendar A.A. 2017/18 • Summer: May 28 - July 6 (3 tests) • Autumn: September 3 - 21 (1 test)

Exam rules Test The exam consists in a written test consisting of two parts: I part a first part made of a bunch of basic questions: the correct answer to the majority of them is enough to pass the exam (with a mark from 18 to 21) (and is preliminary to the correction of the second part) II part a second part made of 2 exercises with multiple questions to have the chance of taking a higher grade. Rules • Mathematics course is introductory to Statistics course (in case of non-compliance Statistics exam will be annulled) • You have to enroll through Esse3 system. The enrollment closes a week before the examination date. • During an exam session you can withdraw. If you deliver your test, the test will be corrected. If you pass you cannot refuse the mark.

Matilde Trevisani

Statistics

a.y. 2017/2018 II s.

1 SYNOPSIS

4

Syllabus Course overview The course provides a basic grounding in the theory and concepts of statistical reasoning, both descriptive and inferential. The first part starts discussing techniques for exploratory data analysis which consists in organizing, displaying and summarizing data. This will include the formation of appropriate scientific questions. Then, it goes on presenting basics of probability theory and random variables to help in understanding the techniques of statistical inference. In the second part the fundamentals of statistical inference are then covered including the primary tools of estimation and hypothesis testing. Students are also introduced to the descriptive and inferential aspects of simple linear regression models. Course content summary The course is structured into three parts: I Exploratory data analysis - Data organization and investigation through frequency distributions, graphical displays, and measures of location, spread and shape. - The study of the relationship existing between two variables using two-way frequency tables, scatterplots, and measures of dependence. Regression line: introduction to (simple) linear regression models. II Elements of probability - Events, axioms of probability, the addition and multiplication rules and associated theorems. Discrete and continuous variables. - The main discrete and continuous probability models. Random variables and independence. Transformations and sum of random variables. III Elements of statistical inference - Introduction to inferential statistics. Sampling distributions. Point and interval estimation and parametric hypothesis testing, mainly focussing on inferences involving the population mean the population variance and the proportion of successes. Test of independence in two-way tables. - Statistical inference applied to the simple linear regression model.

Syllabus (I part) Exploratory data analysis and elements of probability Introduction statistics, scientific method and science; from the world of information to knowledge; descriptive statistics and inferential statistics. Data collection variables, units and population; census and sample surveys; categorical and quantitative variables. Displaying univariate distributions frequency tables; graphical displays (pie chart, bar graph, histogram, stem-and-leaf display, quantile plot, ogive).

Matilde Trevisani

Statistics

a.y. 2017/2018 II s.

1 SYNOPSIS

5

Summarizing univariate distributions measures of location (measures of central tendency, quantiles), spread (range, interquartile range, standard deviation) and shape (symmetry, kurtosis); boxplots; moments. Analyzing the relationship between two variables two-way contingency tables (joint, conditional, and marginal frequencies), and measures of association (X 2 , contingency index, relative risk and odds ratio); grouped data and mean difference across groups; scatterplot, covariance and linear correlation, regression line (computation, interpretation, properties, and prediction). Elements of probability events; probability (notes on different definitions), axioms, the multiplication and addition rules; conditional probability, independent events; the total probability theorem, Bayes theorem; univariate discrete and continuous random variables; density and distribution functions; joint, marginal and conditional probability distributions; expected values and variance; Chebyshev’s inequality. You can also find the course outline • at the Moodle page of the course • at the university teaching page.

Matilde Trevisani

Statistics

a.y. 2017/2018 II s.

1 WHAT IS STATISTICS?

6

Introduction (Part) 1 What is Statistics? 1.1 Some definitions What is Statistics? Some definitions • Statistics is an empirical science – from greek “empeirìa”, from latin “experientia”, “experiri” • state: State (country) / state (condition, latin “status”) – It is the research method of collective (population) phenomena • Statistics ≠ statistics (summaries)! Definition from one of the textbooks (Ch. 1 of Moore’s book) Statistics is a tool to help us process, summarize, analyze, and interpret data for the purpose of making better decisions in an uncertain environment. Basically, an understanding of statistics will permit us to make sense of all the data. Definition from the International Society for Bayesian Analysis (ISBA) Statistics is primarily concerned with the analysis of data, • either to assist in the appreciation of some underlying mechanism, • or to reach effective decisions. In both cases, some uncertainty resides in the situation and the statistician’s tasks are both to reduce this uncertainty and to explain it clearly. Problems of this type occur throughout all the physical, social and other sciences. Uncertainty is described by probability which is a sensible language for a logic that deals with uncertainty, and not just with the extremes of truth and falsity. Statistics is the logic of contemporary society. It is "common sense reduced to calculation". Discipline, scientific method, science, ... In other (more technical) words ... What is the science of statistics? A systematic and objective methodology for effectively using data to answer research questions and test hypotheses in the presence of variation by: • Collecting data • Summarizing data Matilde Trevisani

Statistics

a.y. 2017/2018 II s.

1 WHAT IS STATISTICS?

7

• Drawing conclusion from data From Straf M. (2003), “Statistics: The Next Generation”, JASA, Presidential Address, pdf Many of our leaders have sought to define statistics, leaving us a legacy of countless definitions (see Bartholomew 1995). From the variety of perspectives reflected in these definitions, one might conclude that statisticians do not even agree on what statistics is. But we do generally agree on the elements that it comprises: data; variability; uncertainty; sources of error; conceptualizing and quantifying phenomena; empirical inquiry through experiments, surveys, observational, and other studies; extraction and summary of information; inferences; and communication of results. my personal definition of statistics through what I believe are its very purposes: To increase our understanding, to promote human welfare, and to improve our quality of life and well-being by advancing the discovery and effective use of knowledge from data. This statement defines statistics not as a body of methods nor a collection of data, but rather as an activity. Statistics is the generation and effective use of knowledge from data—data with all their uncertainty, fallibility, and variability.

1.2 An example of Textual Data Analysis The history of statistics through analysis of keywords in an early scientific journal [1][2] We explored the opportunity of reading the History of Statistics by means of the temporal evolution of keywords included in the papers published by the American Statistical Association (ASA)’s journals (Figure 1). In this study we considered the titles of papers in the period 1888-2012. Data collection: from words to keywords 1. Text Harvesting 2. Parsing words 3. Stemming 4. Identifying stem-segments 5. Tagging keywords 6. Thresholding See the following figures on correspondence analysis (Figures 2 and 3) and of some cluster examples (Figures 4-13). Statistics was and is: (see Figure 13)

Matilde Trevisani

Statistics

a.y. 2017/2018 II s.

1 WHAT IS STATISTICS?

8

Figure 1: Abstract from data of the corpus of titles of papers published by JASA from 1888 to 2012

1.3 Data Science Data Science Working with data in a scientific way that will produce new and reproducible insight. Data deluge From The Economist, http://www.economist.com/node/15579717. Over the last several years • data has become much, much cheaper to collect. • It’s much easier to store. • And there’s so many free computing tools out there, that you can actually do something with this entire data deluge that is assaulting all different areas of science and business. Big Data From McKinsey Global Institute, https://www.mckinsey.com/business-functions/digital-mckinsey/ourinsights/big-data-the-next-frontier-for-innovation. The other term that comes into play now is big data which is a sort of a new frontier: we have data in areas that we didn’t used to have that data. For example, now • we have access to information about GPS coords from cars from everybody in the entire world • it is possible to sequence everybody’s genome. Knowledge mapping Uncertainty, variation, ... The fact that variation exists is the reason why you are taking this class. • Statistics aids in finding truth. Matilde Trevisani

Statistics

a.y. 2017/2018 II s.

1 WHAT IS STATISTICS?

9

Figure 2: Correspondence analysis on the corpus of titles of papers published by JASA from 1888 to 2012: projection of years on factorial plane

Figure 3: Correspondence analysis on the corpus of titles of papers published by JASA from 1888 to 2012: projection of keywords on factorial plane

Figure 4: Curves of frequency over time of some words from the corpus of titles of papers published by JASA from 1888 to 2012

Matilde Trevisani

Statistics

a.y. 2017/2018 II s.

1 WHAT IS STATISTICS?

10

Figure 5: Curve of frequency over time of the word statistics from the corpus of titles of papers published by JASA from 1888 to 2012

Figure 6: Curves of frequency over time of some of the most-frequent keywords from the corpus of titles, papers of JASA 1888-2012

Matilde Trevisani

Statistics

a.y. 2017/2018 II s.

1 WHAT IS STATISTICS?

11

Figure 7: Curves of frequency over time of “classical” keywords of Statistics from the corpus of titles, papers of JASA 1888-2012

Figure 8: Curves of some of the most-frequent keywords in the “Ancient History” of Statistics when it dealt with demography and population studies

Matilde Trevisani

Statistics

a.y. 2017/2018 II s.

1 WHAT IS STATISTICS?

12

Figure 9: Curves of some of the most-frequent keywords in the “Ancient History” of Statistics reflecting its role of public and social service

Figure 10: Curves of some of the frequent keywords in “Middle Ages” of Statistics reflecting its role of service to economics

Matilde Trevisani

Statistics

a.y. 2017/2018 II s.

1 WHAT IS STATISTICS?

13

Figure 11: Curves of some of the frequent keywords mostly during the two Wars reflecting its role of service to politics

Figure 12: Curves of some of the keywords common to the “Modern and Contemporary History” of Statistics by now established as an autonomous discipline

Matilde Trevisani

Statistics

a.y. 2017/2018 II s.

1 WHAT IS STATISTICS?

14

(a) Demography and population studies

(d) Classic Statistics

(b) Public Statistics

(c) Economic Statistics

(e) The Basic Statistics Vo-(f ) Modern and Contemcabulary porary Statistics

Figure 13: The history of statistics through analysis of keywords

Figure 14: The data deluge • Statistics provides methods for measuring variation, modeling variation and if needed, indentifying sources of variation. • Because of variation, statistical conclusions about hypotheses are made using probability. Questions from Everyday Life Business Will a new marketing strategy be profitable? Industry Will a product’s life exceed the warranty period? Medicine Will a low carbohydrate diet reduce blood pressure? Education Will technology improve learning? Government Will a change in interest rates affect inflation? Steps for Statistical Problem Solving

Matilde Trevisani

Statistics

a.y. 2017/2018 II s.

1 WHAT IS STATISTICS?

15

Figure 15: Big data

Figure 16: A knowledge map Question Formulation Articulate a research question or a hypothesis to be tested Data Production Collect defensible and relevant data. Data Summarization Graph data and compute numerical summaries. Statistical Inference Draw conclusions about how results apply in a broader context. (Typically, we use information from sample data to describe the population.) Figure 17 displays all these steps to statistical inference. Descriptive vs Inferential Stats Descriptive Organizing, simplifying, giving properties of data • (What we shall do in the I part). Inferential Drawing conclusions about a population based on properties from a sample • This can be very misleading if done incorrectly • (II part will focus on inferential statistics and to do that you will use things that you learned about descriptive stats.) Matilde Trevisani

Statistics

a.y. 2017/2018 II s.

2 DATA COLLECTION

16

Figure 17: Statistical inference Exploratory vs Confirmatory Analysis Exploratory data mining, exploration, investigation, without having well-extablished hypothesis Let data speak by themselves.

Confirmatory model fitting to data, hypothesis test, model validation, model determination

References [1] T R EV ISANI , M. , AND T U ZZ I, A. A portrait of JASA: the History of Statistics through analysis of keyword counts in an early scientific journal. Quality and Quantity 49, 3 (2015), 1287–1304. [2] T R EV ISANI , M. , AND T U ZZ I, A. Learning the evolution of disciplines from scientific literature: A functional clustering approach to normalized keyword count trajectories. Knowledge-Based Systems (2018).

Thoughts on Statistical Thinking From Andrew Gelman’s blog [...] the idea of taking a model seriously–not as a belief or set of betting probabilities, but as a scientific model–a necessarily oversimplified stylized description of reality that can be rigorously tested and eventually refuted, with said refutation providing useful clues into additional information that had not so far been included.

2 Data Collection Statistical concepts Population The entire of all elements we are interested in. • Population size N can be very large or even infinitive. Matilde Trevisani

Statistics

a.y. 2017/2018 II s.

2 DATA COLLECTION

17

Sample Collection of some of the elements obtained from a population. • Sample size n is m ed BC/ > m ed m ed n o/ el

Matilde Trevisani

n a ti on i t al i ta l i ta l ital ital ital i ta l ital n eoc om i ta l n eoc om

w ork s t s pec em pl s p ec manag n on q u a l s p ec w ork s p ec n on q u a l em pl n on q u a l

s ect or h ealt h t ra d e b ui ld b r ok er trade b r ok er t ra d e b u i ld b r ok er t ra d e h ea lt h

c lif e 6 m -2 y 1 -6m >2y 6 m -2 y 1 -6m 1 -6m 6 m -2 y >2y 1 -6m 6 m -2 y 1 -6m

Statistics

f ullt pa r t pa r t pa rt f ull f ull f ull p ar t f u ll f ull p ar t f u ll

n on op en op en fi x ed f ix ed op en fi x ed open op en open f ix ed op en op en

m obi l m ob il n om ob n om ob m ob i l n om ob m obi l m ob il m obi l n om ob n om ob m ob i l

a.y. 2017/2018 II s.

1 DISPLAYING CATEGORICAL VARIABLES M M M . . .

25 -34 h i g h 35 -44 n o/ el 25 -34 BC/ > .

29

n eoc om s p ec t r a d e 6 m -2y f u ll n eoc om w or k m a n u f 6 m -2 y fu ll n eoc om n on q u a l h ea lt h 1 -6m p a rt .

f ix ed open f i xed

n om ob m obi l n om ob

◂ret to example background ▸go to coded data ▸go to sector graph g en d er a ge ed uc n a ti on w ork s t sec tor cli fe f u llt op en m ob il . 0 0 2 0 1 8 2 0 0 1 . 0 1 1 0 3 5 1 0 1 0 . 0 1 2 0 1 4 3 0 1 0 . 1 2 2 0 0 6 2 1 0 1 . 0 1 2 0 5 5 1 1 1 1 . 0 1 3 0 1 6 1 1 0 1 . 0 1 1 0 4 5 2 0 0 1 . 0 1 3 0 1 4 3 1 0 1 . 1 2 1 3 5 6 1...


Similar Free PDFs