AESTHETIC MATH1005 study notes PDF

Title AESTHETIC MATH1005 study notes
Course MATH1005 Statistics
Institution University of Sydney
Pages 41
File Size 4.2 MB
File Type PDF
Total Downloads 66
Total Views 155

Summary

notes taken from the math1005 "statistical thinking with data" course. modules are colour-coded. includes tables and simple illustrations for easier understanding....


Description

Exploring Data - 1.1.1 design of experiments CONTROLLED EXPERIMENTS DOMAIN KNOWLEDGE … background context information that helps you to understand the data ● data scientists need an interest & curiosity in whatever area is being investigated & good collaboration skills ● domain experts: specialised in one field (most often collaborate with specific domain experts) TYPES OF EVIDENCE problematic ●



a personal testimony only suggest a more generalised finding source(s) behind a media article often poorly cited

research ●





every stage of statistical study (design, data collection, statistical methods, conclusion) should be documented and checked in the review process journals require reproducible research → data sets & softwares made available for verifying published findings & conducting alternative analyses a meta analysis collects, summarises and draws conclusions from multiple scientific studies on a specific research question

CONTROLLED EXPERIMENTS – DESIGN ● aim: study whether ‘treatment’ causes ‘response’ → response can be due to other factors (‘variables’) ● conduct 2 parallel experiments which only differ in whether treatment is administered or not 1. Treatment group 2. Control group CONFOUNDING & BIAS ● confounding: occurs when the effect of one variable (X) on another variable (Y) is clouded by the influence of another variable (Z) ● bias: quantity of interest is systematically over or underestimated ○ often caused by confounding variable (or other causes, be desired) bias type

solution

SELECTION bias: if Treatment Group is not comparable to Control Group, then the differences between the 2 groups can confound the effect of treatment

randomised controlled trial (RCT)

OBSERVER bias: if subjects or investigators are aware of the identity of the 2 groups → bias in either

randomised controlled double blind trial (often not possible)

responses or evaluations as they may subconsciously or deliberately report more/less favourable results ● placebo: pretend treatment designed to be neutral and indistinguishable from treatment ● placebo effect: occurs from the subject thinking they have had the treatment



● ●



CONSENT bias: certain subgroups of participants choose not participate in a trial ● raises many ethical questions ○ how can we avoid consent bias? ○ who determines who is apart of each group? ○ may be unethical to withhold treatment for those in the control group or enforce treatment for those in treatment group eg. covid vaccine trial, peanut allergy trial

both subjects (‘single blind’) and investigators (‘double blind’) are not aware of the identity of the 2 groups placebo designed to resemble treatment as closely as possible allow control of patient’s expectations (ie. their responses) and the investigator's observations (evaluation of response) usually have ○ 3rd party administer of treatment & placebo ○ design placebo to mimic treatment as much as possible

no simple solution

*consenting patients often have mild dispositions

Exploring Data - 1.1.2 design of experiments observational studies OBSERVATIONAL STUDIES (OS) 1. assignment of subjects into treatment & control groups is outside of control of the investigator → no randomisation for allocation, no placebo eg. smoking → investigators cannot choose which subjects will be in Treatment group (ethics) 2. many research questions require observational study > controlled experiment 3. conclusions require great care PRECAUTIONS (I-III) 1. difficult to establish causation ● easy to establish association = suggest causation ≠ prove causation ● OS can have misleading hidden confounders ○ hard to find ○ can mislead about a cause-and-effect relationship eg. “smoking is confounded with the effect of alcohol consumption” ○ strategy: Controlling for confounders if a confounder is known, we can potentially add it as an additional variable eg. surveying smokers and alcohol consumption → add variable with 3 possible values therefore “controlling” for alcohol consumption - heavy drinkers → if cancer rate is higher for smokers than non-smokers = alcohol is not a causation for cancer - medium drinkers - light drinkers ■ limitations: ● no clear distinction between groups ● more confounders = greater sample size ● limited by ability to identify all confounders ● long time to establish smoking causes lung cancer → researchers need to control for factors like health, fitnes, diet, lifestyle & environment 2. OS with a confounding variable can lead to Simpson’s/Reversing Paradox ● sometimes there is a clear trend in individual groups of data that reverses when the groups are pooled together ○ occurs when: relationships between % in subgroups are reversed when the subgroups are combined, because of a confounding or lurking variable ○ association between a pair of variables (X,Y) reverses sign upon conditioning of a third variable Z, regardless of the value taken by Z Better drug = B ● confounding variable: variable “day” ○ day 1 easier to cure that day 2 ○ disease is harder to cure when it is further progressed? ● drug A seems better on pooled data since it was mostly given early (day 1) and drug B mostly late (day 2) ● definite answer: one needs to understand why day 1 is different to day 2, different success rates might be a hint there is more to uncover

but young people are more likely to be smokers

quantity of interest = mortality rate 3. historial control ● some studies present themselves as a controlled experiment, but on further examination there is a historical control & time as a confounding variable (partly observational & party an experiment) ● eg. investigators may compare the effect of a new medication on current patients, with an old medication on past patients ○ Treatment Group (new drug) & the historical Control Group (old drug) may different in aspects beside the treatment ● Controlled experiments need to be performed in the same time period (contemporaneously)

Exploring Data - 1.2.1 Data & Graphical Summaries qualitative data DATA ● data: information about the set of subjects being studied (like road fatalities) ○ most commonly, data refers to the sample, not population ● types of data: in different formats eg. survey data, spreadsheet type data, MRI image data ● big data: the massive amounts of data being collected ○ commonly high dimensional (variables p > subjects n) eg. genomics data can have 3 billion variables since a person’s DNA sequence is 3 billion base pairs long, measurements every milliseconds, image or video data ○ requires more complex visualizations INITIAL DATA ANALYSIS (IDA) … a first general look at the data, without formally answering the research questions ● helps you to see whether the data can answer your research questions ● may post other research questions ● can ○ identify the data’s main qualities ○ suggest the population from which a sample derives 1. data background: checking the quality & integrity of the data 2. data structure: what information has been collected? 3. data wrangling: scraping, cleaning, tidying, reshaping, splitting, combining 4. data summaries: graphical & numerical STRUCTURE OF THE DATA ● variables: measures or describes some attribute of the subjects ○ data with p (explanatory) variables is said to have dimension p ● number of variables:



types of variables: eg. age can be qualitative (age brackets) or quantitative → all depends on how you want to analyse it

before analysing

after analysing

int = integers chr = character (R reinterprets as factor) levels = categories GRAPHICAL SUMMARIES ● aim: best highlight features of this data ○ can use trial & error to some extent ○ pie chart → popular, but usually not informative ● tabling data:

## different years ## no. times that year occurred graph

sub-types

simple bar plot/chart/graph

example simple summary of qualitative data

double bar plot (consider 2 qualitative variables)

stacked bar plot

-9 = missing data

side-by-side bar plot

Exploring Data - 1.2.2 Data & Graphical Summaries quantitative data HISTOGRAM ● primitive cleaning of data ○ do not directly remove from data file → ensure reproducibility eg. replace “-9” age by “NA” ● crowding: a lot of data in a small range eg. lots off road fatalities in [18,25) age group

● ●







purpose: quantitative data what does it show: highlights the % of data in one class interval compared to another ○ set of blocks which represent the %s by are ○ are off whole histogram is 100% ○ horizontal scale is divided into class intervals → ≠ same length ○ area of each block represents the % of subjects in that particular class interval mostly use the density scale (advantages for later modelling) → makes area under all blocks = ○ useful for probability

continuous data: need an endpoint convention for data points that fall on the border of 2 class intervals ○ left-closed & right-open: [0,18) [18,25) number of class intervals: consider how many class intervals you want to have

draw the horizontal axis & blocks:

controlling for a variable:



common mistakes: ○ block heights are equal to the %s or total numbers ■ wrongly use total numbers or (%s) as the heights ■ unless the class intervals are the same size, in both cases this will make larger class intervals look like a larger overall % ■ solution: use density as the height, especially if class intervals are not the same size. don’t use % or total numbers ○ use too many/few class intervals ■ can hide true pattern in the data ■ solution: 10-15 class intervals

RESEARCH QUESTION ● statistical thinking: ○ how can we quantify the riskiness of each age group? ○ which variables in our data might be useful? ○ do we need additional data? what kind of data? ● strategy: ○ only count those deaths where person is driver ○ find data for registered driving licenses with age information → for total no. of drivers on the road ○ combine info & derive a death rate per driving license for different age groups 1. only count those deaths where person is driver definition of Road.User

2. find data for registered driving licences with age information South Australia provides this info on data.gov.au

pooled data → no. of driving licences in SA by age group

(idea of distribution)

3. revisit (1) to find data for registered driving licences with age info

4. combine info & drive a death rate per driving licence for different age groups

graph

sub-type

box plot

simple: plots the median, middle 50% of the data in a box, maximum & minimum, and determines the outliers

example

comparative: splits up the quantitative variable by a qualitative variable

scatter plot: examines the relationship between 2 quantitative variables

not so useful here → difficult to determine relationship between age & year in fatalities heatmap: useful when a contingency table is not practical due to too many different values ● good for no. time a particular combo appears ● scaled by column (ensure total number stays constant) & results in something like an Age-histogram for every year

Exploring Data - 1.3.1 numerical summaries centre NUMERICAL SUMMARIES advantages ● a numerical summary reduces all the data to one simple number (“statistic”) ○ loses a lot of information ○ however allow easy comparisons & easily understandable ● major features that we can summarise numerically are: ○ maximum ○ minimum ○ centre → sample mean, median ○ spread → standard deviation, range, IQR ● choice of summary depends on research question ● useful notations for data:

SAMPLE MEAN & MEDIAN sample mean: average of the data

formula

sample median: is the middle data point, when the observations are ordered from smallest to largest → used when data is SKEWED odd sized number of observations:

even sized number of observations:$

what is it

the unique point at which the data is balanced ● involves ALL of the data → outliers pull the mean value ● the higher readings & the lower readings all cancel each other out



involved SOME of the data

on the histogram

50% of the houses sold are below and above $1.4 million

data for 4 bedrooms is right-skewed boxplot

centre line

MEAN VS MEDIAN difference can be an indication of the shape of the data symmetric data

left skewed data

right skewed data

sample mean = sample median

sample mean < sample median

sample mean > sample median





which is optimal for describing centre? ○ median (robust): preferable for data which is skewed or has many outliers eg. Sydney house prices ○ mean: symmetric, with not too many outliers & for theoretical analysis limitations: need to be paired with a measure of spread

ROBUSTNESS & COMPARISONS ● robustness: sample median is said to be robust and is a good summary for skewed data as it is not affected by outliers eg. prices of all the properties are as follows

Suppose there was a data entry mistake, and the lowest property recorded as 370 was in fact the highest sold at 3700. - How would the sample mean change? The sample mean would be higher, as we have replaced the smallest reading by now the maximum - How would the sample median change? The sample median would shift up, from the average of x(28) and x(29) to the average of x(29) and x(30)

Exploring Data - 1.3.2 numerical summaries spread

● ● ●

summarising gaps into 1 number (“spread”) mean of the gap = 0 (gaps cancel each other out) → root mean square (RMS) RMS: measures the average of a set of numbers, regardless of the signs ○ square the numbers ○ mean the result ○ root the result

STANDARD DEVIATION ● variance: squared standard deviation (SD2) ● difference becomes negligible with large sample sizes

formula

population standard deviation SDpop

sample standard deviation SDsample

RMS of gaps from the sample mean

adjusted RMS of gaps from the sample mean

what is it

sd command in R always works out the sample version as we most commonly have samples more effort: shortcut: popsd(data) must load package using library(multicon)

shortcut: sd(data) turning to SDpop: adjust by x (sample size -1) / (sample size)

histogram

how far away a particular point is from the mean STANDARD UNITS / Z SCORE … standard units of a data point = how many standard deviations is it below or b the mean



comparing 2 data points: compare the standard units

INTERQUARTILE RANGE (IQR) … range of the middle 50% of the data ● another measure of spread ● More formally, IQR = Q3 − Q1 where ○ Q1 is the 25% percentile (1st quartile) & Q3 is the 75% percentile (3rd quartile) ○ median is the 50% percentile, or 2nd quartile x˜ = Q2 ● quantile: set of q-quantiles divides the data into q equal size sets (in terms of percentage of data) ● percentile: 100-quantile. ● quartiles: divides the data into four quarters



boxplot: ○ IQR is length of the box in the boxplot → represents the span of the middle 50% of the houses sold ○ lower & upper thresholds are a distance of 1.5 from the quartiles (by convention)







outlier / “extreme reading”: data outside these thresholds$

reporting: ○ like the median, the IQR is robust, so it’s suitable as a summary of spread for skewed data ○ report in pairs: (mean,SD) or (median,IQR) coefficient of variation (CV): combines the mean & standard deviation into one summary

○ ○ ○

measure of spread relative to how large the values are standardises it by the mean used in: not much in stats ■ analytical chemistry to express the precision and repeatability of an assay ■ engineering and physics for quality assurance studies ■ economics for determining the volatility of a security

Modelling Data - 2.4.1 normal curve NORMAL MODEL = UNIVARIATE DATA LINEAR MODEL = BIVARIATE DATA (2 numerical variables) NORMAL CURVE … range of the middle 50% of the data ● origins: Abraham de Moivre 1720 ● importance: ○ approximates many natural phenomena ○ model data caused by combining a large number of independent observations ● symmetrical & bell-shaped AREA UNDER STANDARD VS GENERAL NORMAL CURVE standard (Z)

general (X)

mean

0

any

SD

1

any

denotation

N(0,1)

N(mean, SD^2)

area

● ● ●

● ●

68% → area 1 SD out from the mean in both directions 95% → 2 SDs 99.7% → 3 SDs

any General Normal can be rescaled into the Standard Normal z-score: standard unit



1. integration

2. normal tables

symmetrical

3. use R

Modelling Data - 2.4.2 reproducible reports NOT ASSESSABLE REPRODUCIBLE RESEARCH ● journals are requiring reproducible research, which requires “data sets and software to be made available for verifying published findings & conducting alternative analyses” ○ Begley and Ellis study (2012): found that 47 out of 53 medical research papers focused on cancer research that was irreproducible ○ follow up study by Begley (2013): identified “6 flags for suspect work” ■ experimental design ● studies were not performed by investigators blinded to the experimental versus the control arms ● failure to repeat experiments ● lack of positive and negative controls ■ stats ● failure to show all the data ● inappropriate use of statistical tests ■ use of reagents that were not appropriately validated

DANGERS OF NON-REPRODUCIBLE RESEARCH ● without reproducible research: ○ data versions can change eg. people edit an Excel file without documenting what has changed and why ○ graphical summaries can change eg. people can photoshop images without keeping record of what changed and why ● reproducible research is about being responsible with possible human errors, or worse, detecting intentionally changed results R SCRIPT (.R) ● runs exact same code ● all change is documented

R MARKDOWN (.Rmd) → used in this course ● R Markdown is an authoring framework for data science which produces dynamic, interactive documents with R ● saves and executes code AND produces a high quality report for the collaborator to view and validate the results

INPUTS & OUTPUT FORMATS ● An R Markdown file combines ○ chunks of text (written in markdown) ○ embedded code (eg. R, Python, SQL) ○ YAML metadata (to customise output) ● R Markdown supports dozens of static and dynamic output formats including HTML, PDF, MS Word, Beamer, HTML5 slides, Tufte-style handouts, books, dashboards, Shiny apps, scientific articles and websites

STEPS FOR .RMD → cheat sheet

HANDY TRICKS

Modelling Data - 2.5.1 scatter plots & correlation BIVARIATE DATA … involves a pair of variables. We are interested in the relationship between the two variables ● X = independent variable (or explanatory variable, predictor or regressor) ● Y = dependent variable (or response variable) SCATTER PLOT … a graphical summary of two quantitative variables on the same 2D plane, resulting in a cloud of points



5 numerical summaries:

● ●

creates a “cloud of points” association between the 2 variables


Similar Free PDFs