Association is not causation PDF

Title	Association is not causation
Author	Shamraiz Rafeeq
Course	R studio
Institution	University of Essex
Pages	6
File Size	382.9 KB
File Type	PDF
Total Downloads	23
Total Views	175

Preview

CLICK TO PREVIEW PDF

Summary

Download Association is not causation PDF

Description

Association is not causation

Association is not causation is perhaps the most important lesson one learns in a statistics class. Correlation is not causation is another way to say this. Throughout the Statistics part of the book, we have described tools useful for quantifying associations between variables. However, we must be careful not to over-interpret these associations. There are many reasons that a variable XX can be correlated with a variable YY without having any direct effect on YY. Here we examine four common ways that can lead to misinterpreting data. 19.1 Spurious correlation The following comical example underscores that correlation is not causation. It shows a very strong correlation between divorce rates and margarine consumption.

Does this mean that margarine causes divorces? Or do divorces cause people to eat more margarine? Of course the answer to both these questions is no. This is just an example of what we call a spurious correlation. You can see many more absurd examples on the Spurious Correlations website73. The cases presented in the spurious correlation site are all instances of what is generally called data dredging , data fishing, or data snooping . It’s basically a form of what in the US they call cherry picking . An example of data dredging would be if you look through many results produced by a random process and pick the one that shows a relationship that supports a theory you want to defend. A Monte Carlo simulation can be used to show how data dredging can result in finding high correlations among uncorrelated variables. We will save the results of our simulation into a tibble: N # A tibble: 1,000,000 x 2#> group r#> #> 1 531112 0.838#> 2 464515 0.792#> 3 983513 0.778#> 4 784198 0.776#> 5 24505 0.774#> # … with 999,995 more rows We see a maximum correlation of 0.838 and if you just plot the data from the group achieving this correlation, it shows a convincing plot that X X and Y Y are in fact correlated: sim_data %>% filter(group == res$group[which.max(res$r)]) %>% ggplot(aes(x, y)) + geom_point() + geom_smooth(method = "lm")#> `geom_smooth()` using formula 'y ~ x'

Remember that the correlation summary is a random variable. Here is the distribution generated by the Monte Carlo simulation: res %>% ggplot(aes(x=r)) + geom_histogram(binwidth = 0.1, color = "black")

It’s just a mathematical fact that if we observe random correlations that are expected to be 0, but have a standard error of 0.204, the largest one will be close to 1. If we performed regression on this group and interpreted the p-value, we would incorrectly claim this was a statistically significant relation: library(broom)sim_data %>% filter(group == res$group[which.max(res$r)]) %>% summarize(tidy(lm(y ~ x))) %>% filter(term == "x")#> # A tibble: 1 x 5#> term estimate std.error statistic p.value#>

#> 1 x 1.01 0.137 7.38 0.000000167 This particular form of data dredging is referred to as p-hacking. P-hacking is a topic of much discussion because it is a problem in scientific publications. Because publishers tend to reward statistically significant results over negative results, there is an incentive to report significant results. In epidemiology and the social sciences, for example, researchers may look for associations between an adverse outcome and several exposures and report only the one exposure that resulted in a small p-value. Furthermore, they might try fitting several different models to account for confounding and pick the one that yields the smallest pvalue. In experimental disciplines, an experiment might be repeated more than once, yet only the results of the one experiment with a small p-value reported. This does not necessarily happen due to unethical behavior, but rather as a result of statistical ignorance or wishful thinking. In advanced statistics courses, you can learn methods to adjust for these multiple comparisons. 19.2 Outliers Suppose we take measurements from two independent outcomes, X X and Y Y, and we standardize the measurements. However, imagine we make a mistake and forget to standardize entry 23. We can simulate such data using: set.seed(1985)x [1] 0.00251 There are also methods for robust fitting of linear models which you can learn about in, for instance, this book: Robust Statistics: Edition 2 by Peter J. Huber & Elvezio M. Ronchetti. 19.3 Reversing cause and effect Another way association is confused with causation is when the cause and effect are reversed. An example of this is claiming that tutoring makes students perform worse because they test lower than peers that are not tutored. In this case, the tutoring is not causing the low test scores, but the other way around. A form of this claim actually made it into an op-ed in the New York Times titled Parental Involvement Is Overrated74. Consider this quote from the article: When we examined whether regular help with homework had a positive impact on children’s academic performance, we were quite startled by what we found. Regardless of a family’s social class, racial or ethnic background, or a child’s grade level, consistent homework help almost never improved test scores or grades… Even more surprising to us was that when parents regularly helped with homework, kids usually performed worse. A very likely possibility is that the children needing regular parental help, receive this help because they don’t perform well in school. We can easily construct an example of cause and effect reversal using the father and son height data. If we fit the model:

Xi=β0+β1yi+εi,i=1,…,NXi=β0+β1yi+εi,i=1,…,N

to the father and son height data, with Xi Xi the father height and yi yi the son height, we do get a statistically significant result: library(HistData)data("GaltonFamilies")GaltonFamilies %>% filter(childNum == 1 & gender == "male") %>% select(father, childHeight) %>% rename(son = childHeight) %>% summarize(tidy(lm(father ~ son)))#> term estimate std.error statistic p.value#> 1 (Intercept) 33.965 4.5682 7.44 4.31e-12#> 2 son 0.499 0.0648 7.70 9.47e-13 The model fits the data very well. If we look at the mathematical formulation of the model above, it could easily be incorrectly interpreted so as to suggest that the son being tall caused the father to be tall. But given what we know about genetics and biology, we know it’s the other way around. The model is technically correct. The estimates and p-values were obtained correctly as well. What is wrong here is the interpretation....