Week 4: Descriptive Statistics PDF

Title Week 4: Descriptive Statistics
Course Statistics for Biologists
Institution Queen's University
Pages 9
File Size 340.4 KB
File Type PDF
Total Downloads 77
Total Views 133

Summary

BIOL 243 Week 4: Descriptive Statistics (W20)...


Description

1  of 9  Module 4: Descriptive Statistics 4.2 - Descriptive Statistics - descriptive statistics are how we tell others about the characteristics of our sample - imagine you are in the beautiful La Boqueria market in Barcelona, which by North American standards has a spectacular variety of fresh food…how do you tell someone about the huge variety of seafood that’s for sale? or about the unbelievable size of the scallops? descriptive statistics give us a common language for summarizing this kind of data - the two most important pieces of information in descriptive statistics are central tendency and dispersion - each of these characteristics is calculated differently, depending on whether the variable in question is numerical or categorical - central tendency describes the typical value in your sample (e.g. mean) and dispersion describes the spread of the values Approach for Numerical Variables - these are two approaches to determining descriptive statistics for numerical variables - the first is based on calculating means and the second is based on calculating quartiles - both approaches will be covered after we work through categorical variables Categorical Data - categorical data is characterized using counts and proportions - counts are the number of sampling units in each category, and proportions are the share of the total sampling units in each category - when writing a report, you always want to include the count in a table, but for descriptive statistics it is often easier to understand the data as proportions - Example - the infographic from the Renewable Energy Policy Network for the 21st Century shows the number of jobs in renewable energy in 2016 - the categories are the different types of renewable energy and each job is the sampling unit - proportions for each category: solar 46%, bioenergy 36%, wind 14%, geothermal 2%, hydro 2% Central Tendency and Dispersion - counts and proportions indicate the central tendency of categorical data - on the other hand, range is used to indicate the dispersion, which describes the variation in the response variable - e.g. the proportion of renewable energy jobs in each category ranges from 2% to 46%, which is a relatively wide range Calculating the Mean - the first approach to calculating descriptive statistics for numerical variables is based on the mean value of your sample - the mean characterizes the central tendency of a numeric variable and is calculated by: - summing all of the values in your sample - dividing by the number of data points in your sample - you can easily calculate the mean using a calculator, or a spreadsheet program such as Microsoft Excel - sometimes, however, you will see the steps written as an equation Calculating the Variance - dispersion is characterized by the variance or standard deviation - you only need to calculate variance because standard deviation is the square root of variance

2  of 9 

- we will use the convention that the variance is given the notation σ2 and the standard deviation the notation σ

- the variance is the average distance of each data point from the sample mean - here is how you calculate it - calculate the mean for a sample - calculate the difference between each data point and the mean, then square that value - sum the squares of the differences and divide by the number of observations/data points Quartiles - the second approach for calculating descriptive statistics for numerical variables uses quartiles - quartiles are specific values of the variable that divide your data into ranked bins - e.g. if we sort our data from smallest to largest and split it into four roughly equal groups, then each group contains about 25% of the data - what’s cool is that we now know the range of values that are in the lower and upper parts of the dataset - this will allow us to calculate both central tendency and dispersion - to begin, we need to calculate the quartiles - the steps are: - sort the data in your sample from lowest to highest value - find the second quartile by splitting the data in half according to whether - the sample has an odd number of observations, in which case the middle value of the dataset is the second quartile - the sample has an even number of observations, in which case the average of the two values closest to the middle is the second quartile - find the first quartile by creating a subset of the data that is the lower-value half of the observations, then use the rules in step 2 to find the middle value…the lower-valued subset is created according to whether - the sample has an odd number of observations, in which case the lower-valued subset is all values less than or equal to the second quartile…the subset includes the second quartile - the sample has an even number of observations, in which case the lower-valued subset is all values less than the second quartile…the subset does not include the second quartile - find the third quartile by repeating step 3 but for the upper-valued half of the observations Central Tendency and Dispersion - once we have the quartiles, the rest is easy - the central tendency, or middle value of the sample, is given by the second quartile - since it is the central quartile it is called the median - we know that 50% of the data lies below the median and 50% of the data lies above - dispersion describes how much variation there is in a sample - when working with quartiles, we can describe variation as the range of values that contain the centre-most 50% of the data - this range is between the 1st and 3rd quartiles and is called the interquartile range (IQR) - to calculate the interquartile range, subtract the 1st quartile from the 3rd quartile Quartiles or Means for Numerical Variable - the choice to use quartiles versus means for descriptive statistics of numerical variables depends on what you data looks like

3  of 9  Quartiles Pros - the median and interquartile range are relatively robust to extreme values - consider the two samples of wait times fro Starbucks coffee…in the first sample, the seven observations are similar to each other around a value of 3 minutes…in the second sample, there is one unusually long wait time of over 12 minutes…despite the unusually large wait time in the second sample, the median and interquartile range are similar for both Cons - the median and interquartile range become quite variable for samples with a small number of observations - imagine that you had just three observations in your sample - the median and interquartile range will be sensitive to the value of the middle observation Means Pros - the mean and standard deviation are more robust when there is a small number of observations in the sample - but perhaps the most common reason they are used is that the sample mean and standard deviation are needed for inferential statistics Cons - the downside to the mean and standard deviation is that they are sensitive to extreme values - consider the same two samples of coffee wait times used above, but now looking at the mean and standard deviation - there you can see how much of an impact the single extreme value has on both the mean and standard deviation compared to the quartiles - using the mean and standard deviation is commonplace - however, the robustness of quartiles to extreme values makes them a better choice for characterizing numerical variables, as long as the number of observations in a sample is not too small - no matter what method you choose, be sure that you are consistent between central tendency and dispersion - you do not want use means for one and quartiles for the other Key Takeaways and Definitions 1. Descriptive statistics characterize your sample. Common descriptive statistics include central tendency, which describes the typical value, and dispersion, which describes the variation. 2. Descriptive statistics for categorical variables are based on the number of observations in each category. They can be reported as raw counts or proportions to characterize central tendency, and range to characterize dispersion 3. Descriptive statistics for numerical variables are based either on the sample mean or the sample quartile. 4. The means approach for numerical variables uses the mean for central tendency and the standard deviation for dispersion. Standard deviation is the square root of variance, which measures the typical distance of the observations from the mean. 5. The quartiles approach for numerical variables uses the median for central tendency and the interquartile range for dispersion. The median is in the 2nd quartile that splits the dataset into two halves, and the interquartile range is the difference between the 3rd and 1st quartile.

4  of 9  6. The drawback to the means approach is that it is sensitive to extreme values in the sample, whereas the drawback to the quartiles approach is that it is sensitive when the sample size is small.

4.3 - Effective Size Effective Sizes Tell Us Whether Explanatory Variables Have a Meaningful Impact - do you remember this headline from The Guardian: “Yes, bacon really is killing us.”?…the article, published in 2018, was based on a series of research papers that looked at the impact of eating processed meat on cancer rates - researchers often find statistical evidence that processed meat increases cancer risk - but the media largely ignores the magnitude of the effect, which is called effective size - Example - consuming large amounts of red meat correlates with an effect size of a 20% increased risk of colorectal cancer - is 20% a large increase in risk? - it sounds like a large number - family history increases the risk of colorectal cancer by 30%-90% and drinking alcohol increases that risk by 5% - so eating red meat has a bigger effect on cancer risk than drinking, but both are dwarfed by family history - effect size allows us to put study results into context

5  of 9  Effect Size - while statistical significance is important for indicating robust relationships, it does not tell us whether the explanatory variable has a meaningful effect - meaningfulness refers to whether or not the change you see among groups is important for your study - e.g. adding 5 minutes to your daily commute isn't likely a big deal, but adding 5 minutes to how long your computer takes to boot up is maddening - effect size allows us to evaluate whether the change in the response variables is meaningful for a particular study - we can calculate effect size as a part of descriptive statistics - as such, it is a description of the change in mean among the samples you collected from different groups - in observational studies, effect size is calculated as the change among groups - e.g. in a case-control study, effect size would be the change in mean value of the response variable between case and control groups - in experimental studies, effect size is calculated among treatment levels - e.g. in a single factor experiment, effect size would be the change in mean value of the response variable among the levels of the factor Calculating Effect Size - the effect size that we are calculating is called the absolute effect size, which is the simple change in mean value between groups - unfortunately, “effect size” also refers to a separate topic in inferential statistics - to avoid confusion, we are calculating the absolute effect size for descriptive statistics - yup, it’s cumbersome - but it helps separate the different topics - effect size can be calculated either as a difference or a ratio - Difference calculations are the differences in mean values among groups - Example - a sample of customers at Tim Hortons spent an average of $1.79 for a coffee whereas a sample of customers at Starbucks spent $2.10 for a coffee - the effect size based on difference is $2.10-$1.79=$0.31 - Starbucks coffee is $0.31 more expensive than Tim Hortons - using the difference to calculate effect size has the advantage of retaining the original scale - in the coffee example, effect size is still in units of dollars - Ratio calculations are the ratio of mean values among groups - in the above coffee example, the effect size based on ratios is $2.10/$1.79=1.17 - Starbucks coffee is 1.17 times the cost of Tim Hortons, or 17% more expensive - using the ratio to calculate effect size has the advantage of indicating a relative change, but it loses the original scale Choosing Difference Versus Ratio Calculations - the choice of calculating effect size using differences versus ratios depends on the study - differences give an effect size on the original scale of the response variable, which is particularly useful for studies where it is easy to assign meaning to the scale - money, weight, and speed are good examples of scales that have an inherent meaning - ratios, on the other hand, show a relative change and are useful in studies where the scale does not easily lend itself to meaning - medical studies are good examples where ratios are useful because the overall risk of getting a disease (e.g. cancer) is typically low and difficult to assign meaning

6  of 9 

- by using ratios, researchers can instead evaluate the proportional increase in the risk of getting a disease Key Takeaways and Definitions 1. Effect size compares the change in mean among groups in observational and experimental studies. It can be used to evaluate the meaningfulness of the explanatory variable. 2. Calculating effect size using differences gives a value on the scale of the response variable, whereas using ratios gives a relative change that does not include the scale. Both methods give what is called an absolute effect size. 3. The choice to use differences versus ratios depends a fair bit on whether the scale has an easy to interpret meaning. Differences are commonly used when the scale has a clear meaning (e.g. money) and ratios are commonly used when meaning is more difficult to evaluate (e.g. risk of dying from relatively rare diseases)

5.1 - Contingency Tables Contingency Table - a contingency table shows the frequency (or proportion) of sampling units in each level of a categorical variable - the frequency is simply the number of sampling units that falls in each level - e.g. imagine that we surveyed a group of people to ask about common allergy symptoms and collected the following data - 515 people have no symptoms - 205 people have wheezing - 96 people have asthma - 154 people have hay fever

7  of 9 

- the top row shows the name of each level in the categorical variable, and the cell shows the frequency, which is the number of sampling units with a response that falls in each level Contingency Table as Proportions - sometimes contingency tables are created as proportions rather than frequencies - the preferred method is to use frequencies because it shows the raw data - however, if proportions are used, the table should include the total number of sampling units as a note - e.g. here is the same allergy symptom data as proportions

1-Way and 2-Way Categorical Data - this may seem like odd terminology, but 1-way contingency tables and 2-way contingency tables just refer to the number of categorical variables you observe fro each sampling unit - if you observe one categorical variable, then you have a 1-way contingency table - if you observe two categorical variables, then you have a 2-way contingency table - while we do not consider more than two categorical variables in these lesson, contingency tables expand following the same pattern when there are three or more categorical variables Data of Allergy Types and Home Settings - 515 people have no symptoms (214 in urban, 301 in rural) - 205 people have wheezing (104 in urban, 101 in rural) - 96 people have asthma (55 in urban, 41 in rural) - 154 people have hay fever (62 in urban, 92 in rural) - 60 people have eczema (28 in urban, 32 in rural)

8  of 9 

- the top row shows the name of each level in the first categorical variable, and the first column gives the name of each level in the second categorical variable

- the cell shows the frequency, which is the number of sampling units with a response that falls into the levels of both categorical variables

- e.g. there are 55 people in the allergy study that have asthma and live in an urban setting - shown as proportions, the 2-way contingency table for this data is

Deciding What Categorical Variable Should be the First Variable - the short answer is that it has more to do with your narrative than anything about the variables - in essence, there is no technical reason to select one categorical variable to be the first variable over the other - one thing to consider is that since the first variable is shown as column titles, it often is more prominent to readers - thus, if you were writing an article that focussed mostly on types of allergies, then it is natural to choose the allergy variable as the first variable - on the other hand, if you were writing an article mostly about where people live, then it is more natural to choose the home setting variable as the first variable Visualizing Patterns in Categorical Data - contingency tables are an efficient way to summarize and visualize data, but they don’t do a good job of highlighting patterns when there are two (or more) categorical variables - one way to see the overall patterns in the data is to calculate row and column frequencies, which are called marginal distributions - the easiest way to learn how to calculate each type of distribution is through an example Marginal Distributions - when there are two (or more) categorical variables, it is common to present the total frequencies for each row and column - these are the simple sums across each row and column, and are called marginal distributions - the row marginal distribution shows the total counts for each row across all columns, and the column marginal distribution shows the total counts for each column across all rows - marginal distributions as proportions, the frequencies are divided by the table total - while not a distribution, it is also common to include the sum across all rows and columns - this is often called the “table total” and is the total number of sampling units in the sample

9  of 9  Quick Reference for Calculating Marginal Distributions - to calculate marginal distributions as frequencies - Row: sum frequencies across all columns for each row - Column: sum frequencies across all rows for each column - to calculate marginal distributions as proportions - Table total: sum all frequencies in the table - Row: sum frequencies across all columns for each row and divide by table total - Column: sum frequencies across all rows for each column and divide by table total Seeing the Patterns - the marginal distributions show us how many sampling units are in each level of one categorical variable without considering the other categorical variable - as such, marginal distributions are a good way to describe the overall patterns in the sample - Example - the marginal distributions show us how many sampling units fall into each level of allergy and home setting - the proportion of people who live in urban versus rural areas in the sample is fairly similar (urban:0.45, rural:0.55) indicating that the sample has relatively even representation of people from each setting - in contrast, the distribution of sampling units across allergy symptoms show that half the people do not suffer from allergies, and that wheezing (0.199) and hay fever (0.15) are more common than asthma (0.093) and eczema (0.058) Why Are They Called Distributions? - the term distributions refers to the categorical variables rather than the table - since the data in the table is used to calculate these row or column tallies, the term has stuck and is now commonly associated with contingency tables...


Similar Free PDFs