Week 2 spreadsheet functions pt 2 PDF

Title	Week 2 spreadsheet functions pt 2
Course	Fundamentals of Business Analytics
Institution	Macquarie University
Pages	9
File Size	468.7 KB
File Type	PDF
Total Downloads	2
Total Views	135

Preview

CLICK TO PREVIEW PDF

Summary

Week 2 lecture notes - spreadsheet functions continued...

Description

week 2 lec: spreadsheet functions cont.

Measures of Location Mean (Arithmetic Mean) overall measure of central tendency notated by x excel

mean function

Median value in middle when data is arranged in ascending order odd observations → middle value even observations → avg of two middle values excel

median function

Mode value occurring most frequently in data set multimodal data → data contains two modes bimodal data → data contains exactly two modes excel need to select no. cells of expected modes > use mode.mult > cntrl + shift + enter

Geometric Mean

finding the nth root of the product of n values used in analysing % growth rates in financial data excel

geomean function (on the column of growth)

eg. Percentage Annual Returns and Growth Factors for the Mutual Fund Data. Determine the mean rate of growth for 10 yr period cant look only at return % since its relative

1. product of growth factors

2. geometric mean of growth factors →

3. find annual avg return rate growth →

Measures of Variability

Range difference between smallest and largest value in data set drawback → based on only two observations values

highly influenced by extreme

Variance measure of variability that utilises all data based on deviation around mean (difference between mean and each observation) variance = deviation around the mean squared

Standard Deviation positive square root of the variance measured in same unit as original data

Coefficient of Variation descriptive stat that indicates how large the standard deviation is relative to the mean expressed as a percentage

Analysing Distributions Percentile value of a variable at which a specified (approximate) percentage of observations are below that value.

he pth percentile tells us the point in the data where: approx p % of the observations have values less than the pth percentile approx (100 − p)% of observations have values greater than pth percentile

Quartiles data is divided into four equal parts 3rd - 1st percentile = interquartile range IQR

Z-Score measures the relative location of a value in the data set

determines how far a value is from the mean relative to the data set standard deviation standardised value

in excel → STANDARDIZE (observation, $mean$, $stdev$) empirical rule when data has symmetric bell shaped distribution determine the % of observations within a specified number of st deviations from mean for bell-shaped distribution: 68% data values within 1 st dev 95% data values within 2 st dev almost all data values within 3 st dev

Outliers extreme values in a data set can be identified using standardized values (z-scores) value with a z-score less than –3 or greater than +3 can be reviewed to determine accuracy and whether they belong in the data set

Box Plots graphical summary of the distribution of data. developed from the quartiles for a data set

Measures of Association Between 2 Variables Scatterchart useful graph for analyzing the relationship between two variables shows positive / negative + linear/non linear relationship

Covariance descriptive measure of the linear association between two variables

covariance in excel → Covariance.S(column 1, column 2) correlation coefficient in excel → correl(column 1, column 2)

Correlation Coefficient measures the relationship between two variables not affected by the units of measurement for x and y

Interpretation of Correlation Coefficient: r is between -1 and +1 closer to -1 or +1

Data Cleansing Missing Data

datasets commonly have observations with missing values for some variables legit missing data → naturally occurring missing data. no need for fixes illegit missing data → missing data occur for different reasons. fixed with following options: discard observations (rows) with any missing values discard any variable (column) with missing values fill in missing entries with estimated value apply a data-mining algorithm that can handle missing values missing completely at random (MCAR) observation to be missing the value for some variable is entirely random does not depend on either the value of the missing data or the value of any other variable in the data missing at random (MAR) missing a value for some variable is related to the value of some other variable(s) in the data imputation → systematic replacement of missing values with values that seem reasonable

Identification of Erroneous Outliers + Values examine data to uncover data quality issues using: summary statistics frequency distribution bar charts and histograms z score scatterplot correlation coefficient possible causes for misleading summary stats many software ignore missing values when calculating various summary statistic

if missing values indicated w unique value (eg #NA) can be accidently used by software when calculating summary stats better to deal with missing data before using summary stats to identify erroneous outliers + values

Variable Representation number of variables may make data harder to analyse in many data mining applications dimension reduction → process of removing variables from the analysis without losing crucial information critical part of data mining is how to represent variable measurements + which variables to consider datasets often have non useful variables but can be combined with others into new variable that shows important relationship...