Lecture 1 + 2 notes PDF

Title Lecture 1 + 2 notes
Author John Doe
Course Introductory to Statistics
Institution The University of British Columbia
Pages 13
File Size 321.5 KB
File Type PDF
Total Downloads 56
Total Views 349

Summary

COMMERCE 291 – Lecture Notes 2021 – © Jonathan BerkowitzNot to be copied, used, or revised without explicit written permission from the copyright owner.Summary of Lectures 1 and 2Welcome to the year 2021Pattern recognition is a hallmark of intelligence. Statistics is all about pattern recognition. H...


Description

COMMERCE 291 – Lecture Notes 2021 – © Jonathan Berkowitz Not to be copied, used, or revised without explicit written permission from the copyright owner.

Summary of Lectures 1 and 2 Welcome to the year 2021 Pattern recognition is a hallmark of intelligence. Statistics is all about pattern recognition. Here’s a demonstration. Consider a simple Letter-Number Coding scheme: A=1, B=2, ..., Z=26 B E R K O W I T Z ➔ 2 +5+18+11+15+23 +9+20+26 = 129 Rearrange the digits and you get: 291 = the number of our course Now, this is the year 2021. Apply the same Letter-Number Code to the year spelled out as: TWENTY TWENTY-ONE. The sum of the values = 248. According to the Hitchhiker’s Guide to the Galaxy, by Douglas Adams, the “Answer to the Ultimate Question of Life, the Universe and Everything,” calculated by an enormous supercomputer computer named Deep Thought is 42. Look it up in Wikipedia, or better yet, read the books. Add 248 and 42 to get 290. That’s the number of last term’s course. We need to add 1 to get to 291. You are the ONE needed to help our course find the answer to life, the universe and everything in the year 2021! Every number is interesting, in some way. And that means, every data set is interesting, so Statistics is interesting, as you’ll find out this term! Words and letters are interesting too. Rearrange the letters of STARTS INTO and you can spell INTRO STATS. So, let’s start into Intro Stat!

What Is Statistics? “Statistics… the most important science in the whole world; for upon it depends the practical application of every other science and of every art; the one science essential to all political and social administration, all education, all organization based on experience, for it only gives results of our experience.” {Guess the source.} The statement recognizes that you can’t figure out what might happen in an individual case until you’ve analyzed what happens in the aggregate. Individuals are not predictable, but crowds are. That’s one of the cornerstones of medicine, management, politics, economics (particularly behavioural economics), and so on.

1

That’s what Statistics is about… looking for patterns in the aggregate that can then be applied to the individual. That’s how Netflix can make viewing recommendations, or Amazon can suggest related products for you to buy. All aspects of the COVID-19 pandemic depended on statistics. The terms like “flattening the curve” and “exponential growth” are based on statistical concepts. But viruses are not the only problem for the world to face. There’s also climate change, poverty, hunger and food security, disease, war, inequality, and existential risks. Check out the organization Our World in Data, at their website www.ourworldindata.org. It all comes down to data and data analysis, that is, Statistics. A Thumbnail History and Definition of the Subject The word “statistics” comes from the Latin word for the “state”, because the first data collected were for the purposes of the state—taxes and military service. Birth and mortality rates appeared in England in the 17th century, about the same time French mathematicians were laying the groundwork for probability by studying gambling problems. Applications to studies of heredity, agriculture, and psychology were developed by the great English scientists, Galton, Pearson, and Fisher, who gave us many of the techniques we use today: design of experiments, randomization, hypothesis testing, regression, and analysis of variance. With such a diversity of origin, it is not surprising that the word “statistics” means different things to different people. Small-s statistics (i.e., what are statistics?) • numerical or quantifiable facts • computations based on these facts (e.g. average or percentage) • measurements, counts, ranks • a synonym for “data” Large-S Statistics (i.e. what is Statistics?) • a set of methods for collecting, organizing, summarizing, presenting, and analyzing numerical facts • generalizations or inferences about the whole based on partial knowledge rather than complete knowledge • decision-making in the face of uncertainty. Here are five of my favourite statements about Statistics. 1. Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write. ~ H.G. Wells 2. Knowledge of statistical methods is not only essential for those who present statistical arguments; it is also needed by those on the receiving end. ~ RGD. Allen

2

3. We live not in a time of information explosion but in a time of data inundation. ~ William G. Hunter 4. Statistics should not be used the way a drunk uses a lamppost, that is, for support rather than for illumination. ~ Andrew Lang 5. “Data, data, data!” he cried impatiently. “I cannot make bricks without clay.” Sherlock Holmes “Statistics is a set of ideas and techniques that enable the user to collect data efficiently and then to discover what the data mean. Statistics is an applied discipline. [It] is not a purely deductive discipline. It involves art as well as science, individual judgment as well as careful, logical deductions. Statistics is used as an aid to decision-making. It is used to control manufacturing processes and to measure the success of those processes. It is used to calculate premiums on insurance policies. It is used to identify criminals. In the health sciences, finding a new statistical relationship between two or more variables is consider ample grounds to write and publish yet another paper. Statistics is used to formulate economic policy and to make decisions about trading stocks and bonds. It would be difficult to find a branch of science, a medium to big business, or a governmental department that does not collect, analyze, and use statistics. It is an essential science.” ~ John Tabak Why is Statistics Important to You? Statistical analysis plays an important role in virtually all aspects of business. Here are some business-related questions that statistics can help answer. (Source: Sharpe, De Veaux, Velleman, Berkowitz)

• • • • • • •

Do university students from different parts of the world perceive business ethics differently? What is the effect of advertising on sales? Do aggressive "high-growth" mutual funds really have higher returns than more conservative funds? Is there a seasonal cycle in your firm's revenues and profits? What is the relationship between shelf location and cereal sales? How reliable are the quarterly forecasts for your firm? Are there common characteristics about your customers and why they choose your products? Are they the same characteristics among people who aren't your customers?

The underlying concept is variation or uncertainty. The world is full of variation, and Statistics is used to distinguish real differences from natural variation. The essence of statistics is the ability to understand variation. You learned some aspects of uncertainty in your previous course on Quantitative Decision Making. Now you’ll learn other aspects that incorporate the uncertainty of data.

3

In fact, statisticians say that the Science of Statistics is also the Science of Uncertainty! Statistics is also about the communication of information. Communication is a basic function of analytic thinking. To summarize, a knowledge of Statistics helps us make better decisions in business and in life, answer important social questions, and evaluate the effectiveness of policies, programs, procedures, and products. And understanding the subject helps you to avoid falling victim to those who misuse statistical tools for evil purposes. Basic concepts in statistical literacy “Our society would be unimaginably different if the average person truly understood basic mathematical concepts.” ~ Douglas Hofstadter, cognitive scientist, author of Godel, Escher, and Bach and coiner of the term “innumeracy” #1. The size of numbers How big is a million? How big is a billion? How big is a trillion? How many millions in a billion? ... in a trillion? How small is one in a million?... one in a billion? ... one in a trillion? #2. What is an average? It is not as simple as you think. Is there more than one type of average? How do you compute an average of rates? #3. What is a percentage? It is also not as simple as you think. Percentages are fractions. The denominator or base is important and is needed for context. #4. What is randomness? Is it just the absence of pattern? How is it related to uncertainty? #5. Can you make a sensible estimate? The ability to make rough common-sense estimates starting from just a few basic facts. will help you to better understand the world around you and better recognize numerical, political, and scientific nonsense. See “Extra Material” following, for examples.

4

Extra Material: Estimation Many top companies use estimation questions in job interviews to judge the intelligence and flexibility of their applicants. [They are often referred to as Fermi Questions.] For example, how many circus clowns can you fit in a Honda Civic or a Mini? Hint: It is often easier to establish lower and upper bounds for a quantity than to estimate it directly. If we are trying to estimate how many circus clowns can fit into a Honda Civic, we know the answer must be more than one and less than 100. We could average the upper and lower bounds and use 50 for our estimate. This is not our best choice because it is a factor of 50 greater than our lower bound and only a factor of two lower than our upper bound. We want our estimate to be the same factor away from our upper and lower bounds the same factor, so use the geometric mean: multiply the two bounds and take the square root. (Or average the exponents of the powers of 10; if odd, decrease the sum by one and multiply the final answer by three.) For the clown case, 1 x 100 = 100; the square root is 10. More illustrations of estimation: 1) Volume of the Grand Canyon: about 4.17 trillion m3, or 5.45 trillion cu. yd. Build an apartment-sized box: 10 m long x 10 m wide x 5 m high = 500 m3. * 4.17 trillion divided by 500 = 8.34 billion = # of boxes that would fit in the Grand Canyon. That exceeds the population of the earth, which is about 7.75 billion people (as of the end of 2019)! 2) How much would a million U.S. $1 bills weigh? Answer: 500 sheets of copier paper weighs about 4 lbs or about 2 kg. Can fit approximately 5 bills per sheet, so 5 x 500 = 2500 bills weigh 4 lbs. A million is 2500 x 400, so a million $1 bills weigh 400 x 4 lbs or 400 x 2 kg = 1600 lbs or 800 kg. 3) Fold a piece of paper in half 50 times. How thick is the pile? Answer: 50 folds means 250 = 210(5) = 10245, which is about 10005 = 1015 = 1 quadrillion layers. 10 folds is about 4 inches, 25 folds is about 2 miles, 50 folds is about 64 million miles, 51 folds is greater than the distance from the Earth to the Sun. 4) What is the volume of human blood in the world? Answer: The average adult human has about 5 litres of blood; children have considerably less. So, let’s say there are 4 litres of blood per person. The world population is about 7.75 at the end of 2019, so let’s round up to 8 billion. Then 8 billion people x 4 litres = 32 x 109 litres of blood in the world One litre is 1000 cm3 or 0.001 m3. So there are 32 x 106 m3 of blood. Take the cube root = 320 metres per side of a cube, or 0.032 (1/32) of a cubic kilometre.

5

Central Park in New York is 843 acres or about 3.41 km2. If walls were put around it, then all the blood in the world would cover the park to a depth of something under 10 metres or about 30 feet. The Dead Sea is about 605 km2 (in 2019, and it’s receding). Adding the blood would increase the depth by about 5 cm. or about 2 inches (and would make it the Red Sea instead of the Dead Sea). 6) On average, how many people are airborne over the U.S. at any given moment? (Source: Guesstimation, by Lawrence Weinstein and John Adams) Answer: There are two basic ideas here. 1. The fraction of time the average person spends flying equals the average fraction of people that are airborne at any instant. Thus if you spend 10% of your time flying then on average 10% of the population is airborne at any given time. 2. We can use our own experience to estimate the fraction of time an average person spends in the air (or doing anything for that matter!). In other words, Number flying now / US population = Time spent flying / 1 year. * US population: 3 x 108 Americans How many plane flights does each American take per year? Most probably travel once a year (two flights) on vacation or business and a small fraction (say 10%) travel much more than that. This means that the number of flights per person is between two and four, so we’ll use three. The typical flight will take between one and six hours (not counting time spent parking, checking in, checking baggage, going through security, eating, etc.) so we will estimate three flights per year at three hours per flight, or nine hours per year in flight. Insert the numbers we know: Number flying now / 3 x 108 people = 9 hr / 400 days x 25 hr/day. 400 days x 25 hr/day = 10,000 hr = 104 hr Number flying now = 3 x 108 people x (9 hr / 104 hr) = 27 x 104 = 270,000 So, there are about 300,000 people airborne over the US at this moment. [That would mean about 30,000 Canadians airborne over Canada.]

End of Extra Material

6

Chapter 1: Statistics, Data, & Decisions Data – Where it All Begins An often-used synonym for statistical analysis is “data analysis.” Let’s begin by thinking about data. Note that the word ‘data’ is plural (the singular is ‘datum’); it comes from the Latin meaning ‘to give’; so in the current sense, data are the information given to us to analyze and interpret. To be grammatically correct, say, “data are” not “data is”. How do you pronounce “data”? With a long A or a short A? Statistics: The science of using simple words for complicated concepts. Origin of the word: The word ‘data’ is plural (the singular is ‘datum’); it comes from the Latin meaning ‘to give’; so, in the current sense, data are the information given to us to analyze and interpret. To be grammatically correct, say “data are” not “data is.” Terminology: • Variable – a characteristic recorded about an individual (usually a column in a spreadsheet • Value – a specific observation of a particular item or process (i.e. one entry in a column in a spreadsheet • Data (or small-s statistics) – the collection of values of the variables • Observations – another word for data (For example: the height of students in a class is a variable. Once you measure each student, each student’s height is a value. Then the actual values of height for each student are the data.)

• • • • • • •

Data table – an arrangement of data in rows and columns; a spreadsheet Record – a row in a spreadsheet Case – an individual in spreadsheet for which there are data; often there is one record per case, but multiple records per case are possible in large data sets. Database – a complex data structure possibly involving multiple spreadsheets all linked so that information across them can be combined. Respondent – an individual who answers a survey Subject – a human participant in an experiment Experimental Unit – a "non-human" (i.e., animal, plant, inanimate object) participant in an experiment.

7

Two Types of Data (or Variables) We begin by thinking about types of data (or variables). The simplest classification is a dichotomy. Data (or variables) are either categorical or quantitative (or measurement). The first principle of data analysis is to understand which type of data you have. But not only is this the first principle, it is undoubtedly the most important one. If you do not know what type of data you have you cannot choose an appropriate analysis! Categorical data: (also called discrete, or count data) Data are categorical if observations can be put into distinct “bins”. In other words, there are a limited number of possible values that the variable can take. There are three subtypes of categorical data: • Binary: the most basic categorical data; there are only two possible values. Some examples: Yes/No, Defective/Non-defective, Survive/Die, Accept/Reject, 0/1. Others: Pass/Fail, Live On/Off Campus, Prof/Student, Faculty/Staff, Win/Loss International/Domestic Student, Buy/Not Buy, PC/Mac, Vote For or Against • Nominal: extension of binary to more than two categories, but the categories are unordered. Nominal means “named.” Some examples: ethnicity, BCOM major, industry sector, marital status. • Ordinal: extension of binary to more than two categories, but the categories are ordered. Ordinal means “ordered.” Some examples: letter grade (A, B, C, D, F); gold, silver, bronze; 3-point scale of change – better, the same, worse; highest level of education; ranking of top-performing stocks or businesses. Typical 3point, 5-point, or 7-point response scales of agreement, satisfaction, etc. are ordinal in nature. Quantitative data (also called measurement, continuous, or interval) One easy way to identify quantitative data is that they have measurement units. However, sometimes the units are imaginary. Quantitative data are also characterized by the involvement of some kind of measurement process such as a measuring instrument or questionnaire. And, there are a large number of possible values with little repetition of each value. For example: age (in years), height and weight, salary, percentage grades, return on investment. Sometimes the type of data is clear and obvious, sometimes it is not. Comments on why determining the type of data is not always straightforward: #1. Quantitative variables can be created by summing ordinal variables. #2. Some variables can be quantitative in theory, but categorical in practice. For example, a survey question asks about household income, but only gives broad categories, since respondents would not and could not give an actual dollar figure.

8

#3. Some variables can be expressed as more than one type of data. For example, age in years is a quantitative variable, but can be turned into a categorical variable. It depends on the mechanism of measurement and the future use of the data. #4. In general, there is more information in quantitative data than in categorical data. For example, percentage grade (quantitative, with % as the units) vs. letter grade (ordinal) vs. pass/fail (binary).

Identifier and String Variables Other types of information are often found in spreadsheets; some are known as identifier variables, while others are called string variables. Examples of Identifier Variables: Student ID Number, Social Insurance Number, UPS Tracking Number Examples of String Variables: Dates, times (on a 12-hour or 24-hour clock), text or non-numeric symbols. Identifier variables and string variables are neither categorical nor quantitative (even though they look like numbers, there are no units). Date strings can be transformed into data; for example, subtract a birthdate from the current date to get age. But without transformation, dates are not categorical and are not quantitative.

Cross-sectional vs. Time Series Data Cross-sectional: data are collected at one point in time (e.g., surveys) Time Series: data are collected longitudinally at various time points (e.g., sales records) Once again, sometimes the type of data is clear and obvious, sometimes it is not. It can depend on context and on ultimate use of the data; that is, how will you analyze the data.

9

Example: Employees at ABC Company must complete an employee questionnaire which is kept on file by the Human Resources Department. Following is a sample of the questions. For each, decide whether it is more likely to be categorical or quantitative, whether it could be either, or whether it could be neither! • Date of birth • Highest level of education • Number of jobs in past 10 years • Type of residence • Number of children • Before-taxes income in the last year before joining ABC Company • Alcohol consumption • Absenteeism (# days of worked missed in a year) Answers: • Date of birth: this is neither categorical nor quantitati...


Similar Free PDFs