Univariate Descriptive Analysis (literature review) PDF

Title	Univariate Descriptive Analysis (literature review)
Author	Izabela A Wowczko
Pages	13
File Size	718.8 KB
File Type	PDF
Total Downloads	227
Total Views	635

Preview

CLICK TO PREVIEW PDF

Summary

Univariate Descriptive Analysis Izabela Anna Wowczko Institute of Technology Blanchardstown Abstract Descriptive analysis is a method of describing the main features of data. It can be performed for a particular investigation in understanding a set of items, or be an initial element of more extensiv...

Description

Accelerat ing t he world's research.

Univariate Descriptive Analysis (literature review) Izabela A Wowczko

Related papers

Download a PDF Pack of t he best relat ed papers 

(R)Dat a Mining Wit h Rat t le and R T he Art of Excavat ing Dat a for Knowledge Discovery - Graha… Luciano Scherrer

Dprep: Dat a Pre-Processing and Visualizat ion Funct ions for Classiﬁcat ion Edgar Acuna STAT IST ICAL MET HODS FOR T HE SOCIAL SCIENCES Christ ian A Hesse

Univariate Descriptive Analysis Izabela Anna Wowczko Institute of Technology Blanchardstown

Abstract Descriptive analysis is a method of describing the main features of data. It can be performed for a particular investigation in understanding a set of items, or be an initial element of more extensive analysis. In machine learning, data is the source of hidden information that can be useful in knowledge discovery. In a row form, however, it is merely a collection of random facts. To review its value, some assessment is required evaluating the properties, integrity and possible issues within the dataset. This paper presents descriptive statistics by example of univariate analysis. We discuss the basic form of statistical analysis - summary statistics and visualisation techniques for exploring a single variable.

1. Introduction Machine learning is a discipline heavily reliant on the quality of data. Therefore, the successful knowledge discovery is directly related to understanding of the underlying records and their properties. Since the cognitive abilities of human are arguably limited, there is a necessity for employing tools that make this evaluation more manageable. Statistics provide summary and visualisation techniques that capture basic characteristics and trends among the examples of a dataset. They offer a valuable insight and allow identifying possible issues that can be resolved in further data processing. The reminder of this paper is structured as follows. Section 2 introduces statistical data exploration. Section 3 and Section 4 explain descriptive statistics for nominal and numerical variables. Section 5 summarised the topic by discussing the characteristics of good measures.

2. Statistical Data Exploration Statistics are particularly useful in the data exploration stage of a business intelligence process. It is the first analysis that aims on developing a high-level understanding of the collected items – their characteristics and the variety among the values of individual samples. In this process, the information contained in a dataset is condensed to key aspects that allow better judgement in further transformations. Business Intelligence and Data Mining, 2014

For any data preprocessing to be successful with respect to a given objective, it is essential to have an overall picture of all observations and their attributes. This is achieved through exploratory data analysis that highlights certain aspects of a dataset. To fully comprehend its complexity, it is paramount to look closely at each variable and understand how it reflects the properties and trends of the collected samples. This kind of analysis is known as univariate analysis (Cooper & Weeks 1983). While exploratory analysis is a broad term that involves many elements, univariate analysis is its fundamental component that describes the data with a set of calculated measures or in a graphical way. Although, in general, carried by means of summary statistics and various visualisation techniques, the appropriate descriptors are dependent on the type of an attribute representing a certain property of a set. Based on the values they can take, we can distinguish numerical and nominal variables. Numerical variables are represented by numbers which have their usual meaning. Nominal variables, on the other hand, assign one value from a finite set of categories to each individual example. It is a common practice to use numerical coding for a nominal variable. These numbers, however, although might indicate a rank ordering, do not carry the usual mathematical information (Han & Kumar 2006). Various methods are best suited to analyzing various types of attributes. Section 3 and Section 4 present the most common ways of summarising and visualising nominal and numerical variables. They discuss non-graphical methods which are quantitive and straightforward in their meaning, and complementing graphical methods which are more qualitive and involve a degree of subjective analysis in their reading.

3. Exploring Nominal Variables Since nominal attributes are represented by categories, they are often easier to understand and interpret. Statistically, however, they contain less information than numerical values. Although some arithmetical measures such as frequency and central tendency are used to analyse the structure of nominal variables, they are less efficient in capturing the properties of qualitative data (Seltman 2003) 3.1

Frequency and Central Tendency

The basic measure in summary statistics of a nominal variable is the frequency. Frequency denotes the number of times a specific value of an attribute can be observed in a dataset. Given N size of a dataset, the sum of all individual frequencies will always be equal to N.

Business Intelligence and Data Mining, 2014

Relative frequency, also referred to as percentages, is a variation of the standard frequency. A relative frequency within a dataset refers to the fraction of a certain value among all observations in a dataset. Therefore, the sum of all percentages for N always adds up to 100 per cent. Relative frequency is a way of calculating a mode, the only central tendency measure suitable for nominal data. Since mode is a measure of the most frequent occurrence, it is simply the value of the variable with the highest relative frequency. It is worth noting, that in some cases a variable can have multiple modes. The meaning of it is explained further in the paper. 3.2

Graphical Methods

The most common visualisation technique of a nominal variable is a bar chart. It presents all possible values (categories) of a variable, and the number of occurrences indicated by the size of a corresponding bar. Sample bar charts are illustrated in Figure 1.

Figure 1. Sample bar charts for nominal variable (source: Internet)

A pie chart or a donut chart is a visualisation tool which illustrates relative frequencies of values. Each sector represents a portion of occurrences for a value of a variable among all Business Intelligence and Data Mining, 2014

samples in a dataset. Altogether, they form a full cycle or ring (donut) synonymous with 100 per cent of the samples. Figure 2 presents sample pie and donut charts for a categorical variable.

Figure 2. Sample pie and donut charts for a categorical variable (source: Internet)

The concepts of central tendency, spread or skew – typical elements of summary statistics have no meaning for categorical variables (Seltman 2003). Therefore, although easily interpretable, the graphical methods are also limited in the properties of a variable they can visualise. More formally, nominal variables are oftentimes presented in a form of frequency tables. Frequency tables arrange and summarise values in a way that reflects their distribution in a dataset (Figure 3). They can simply aggregate categories, but also provide additional information such as relative frequency.

Figure 3. Sample frequency tables for a nominal variable (source: Internet)

Business Intelligence and Data Mining, 2014

4. Exploring Numerical Variables

The nature of numerical variables allows more precise analysis with the use of statistical measures and their graphical interpretation. However, most of the available tools require that the data is sorted in order of magnitude. Additionally, for highly diversified discrete and all continuous variables, values need to be grouped (binned) in equally sized cluster that make the analysis more manageable (Shahbaba 2012). Among typical descriptors of numerical attributes are the measures of central tendency, spread, modality and shape. 4.1

Central Tendency

Central tendency of a dataset is the point around which most other points are gathered, typically the middle or most frequent value. The common measures of a central tendency are statistics such as mean, median, and previously mentioned mode. Arithmetic mean (average) is the universal measure of central tendency. It adds up values of a variable for each example in a dataset and divides the sum by the number of all samples. A variation of the average is the weighted arithmetic mean (weighted average) that associates weights with each value of the variable. The weights reflect significance or occurrence frequency of values within a given set. Trimmed mean is another divergence from the standard measure, which excludes extremely high and low points from calculating the central tendency. Additionally, there are two other types of mean which are better suited to measuring relative values. Geometric mean uses the product of all values resulting from multiplying several quantities together as opposite to additions (i.e. calculating an average rate of return on investment). Furthermore, harmonic mean deals with rate values such as average speed or price. From the data mining standpoint, the modified averages are more useful than the arithmetic mean as they allow incorporating some domain knowledge and limit the influence of outliers and noise on the measure of central tendency. Obviously, when incorrectly performed, trimming can result in the loss of valuable information and weighting can excessively alter the distribution of values for a variable (Han & Kamber 2006) Median is a measure that marks the middle value in a numerically ordered set. In case of even number of samples, the median is the average of the two middle values. For symmetric distributions, the mean and the median coincide in a given set (Figure 4). However, when the spread is asymmetric, the mean moves farther away from the median. For many slanted distributions (left- or right- skewed), median is the preferred measure of central tendency as it is more robust to unusually high and low values. Conversely, the mean is a measure heavily affected by any outliers (Seltman 2003).

Business Intelligence and Data Mining, 2014

Mode, although usually applied to describing nominal variables, has an important role in exploring numerical attributes. It is commonly used in identifying modality of a dataset and therefore, contrary to the mean and median, can take more than one value. Multiple modes indicate that there are subgroups within samples, and therefore the dataset is not homogeneous. In some application, the midrange can be calculated as a measure of central tendency (Han & Kumar 2006). It is an average of the largest and the smallest values of a certain variable that marks the midpoint in the full range of values.

Figure 4. Measures of central tendency and distribution (source: Internet)

4.2

Measures of Spread

The spread of a distribution (variability) refers to the variance among the values of a variable. Whereas averages give a picture of a centre, various measures of distribution illustrate dispersion, namely how far the samples fall from the centre and how scattered they are. There are several indicators of variability, including range, standard deviation and fivenumber summary. One measure of dispersion is the range – the difference between the minimum and maximum values. Although it is straightforward and easy to calculate, it does not reflect the distribution of values, but rather the smallest interval which contains all samples of a dataset (Shahbaba 2012). A common summary statistic that measures the dispersion more precisely is standard deviation. It is based on the deviation of all values in a dataset from the central tendency. Since the sum of all deviations is always zero, a supporting measure called variance is required. Variance is calculated from squared deviations which make negative values positive, and therefore allow estimating an average from their sum. The square root

Business Intelligence and Data Mining, 2014

of a variance is the standard deviation which tells how much deviance there is, on average, from the mean of a variable. Although standard deviation is more accurate than range as it takes into consideration all items, its calculation is based on the mean and therefore affected by extreme values (Han & Kumar 2006). For values sorted numerically, it is possible within its range to distinguish a number of subsets. The popular way of analysing the spread of distribution is by dividing the data into even fourths - quartiles. Median splits all samples into equally sized halves; therefore it is synonymous with a second quartile (Q2). Similarly, the first quartile (Q1) is the point that is equal or greater than at least 25 per cent of the values, and the third quartile (Q3) is a point equal or greater than 75 per cent of samples. The interquantile range (IQR), also known as midspread, is a measure of spread between Q1 and Q3 and it gives the range of values covered by the middle 50 per cent of data. However, none of the quartiles contain information about the endpoints of the data and possible outliers. To get a better overview of a distribution and its shape, five-number summary is used (Saltman 2003). Five-number summary consists of minimum, Q1, median, Q3 and maximum. It covers a whole range of samples and allow indentifying outliers. It is particularly useful when presented graphically in a form of boxplot, which is covered further in the paper. 4.3

Modality and Shape

Modality of a dataset refers to the number of modes identified within. Unimodal data has a single mode and is one of the characteristics of a normal distribution. In some cases, however, the distribution curve can have multiple peaks which are related to multiple particularly dense values in a dataset (Figure 5). As mentioned earlier in the paper, multi-modal dataset are not homogeneous and contain evident subgroups with regard to the variable.

Figure 5. Simplified unimodal vs. multimodal (bimodal) distribution (source: Internet)

Skewness is a measure of asymmetry. In reality, the distribution of values is rarely symmetric. However, the distribution is considered as such if the densities for points or intervals that are equally distanced from the central tendency are similar (Shahbaba 2012). In other

Business Intelligence and Data Mining, 2014

cases, the curve of distribution is left-skewed (long left tail) or right-skewed (long right tail) as opposite to bell-shaped symmetrical distribution (see Figure 4). Kurtosis is a degree of peakedness of a distribution curve (Figure 6). A point of reference in measuring kurtosis is normal distribution (Gaussian) with kurtosis equal to 0 (mesokurtic distribution). If the tails are larger, with many points near the left and right bounds, the kurtosis is a positive value (leptokurtic distribution). If the tails are small and majority of points is concentrated closer to the central tendency, the kurtosis is a negative number (platykurtic distribution).

Figure 6. Kurtosis (source: Internet)

4.4

Graphical Methods

Numerical attributes, just like nominal, can be easily visualised with the use of bar charts and pie or donut charts. However, for quantitive variables, the name histogram is used with reference to a bar chart. This distinction comes from the fact that histograms need the data to be sorted. Whereas, most of the time, there is no natural ordering to categories, numerical values usually carry additional information that is only evident when the visualised data is organised in the order of magnitude. Similarly to a bar chart, a histogram represents values and their frequency in a dataset. It is common to group (bin) continuous and discrete values into intervals, as mentioned earlier in the paper. Binning smoothes the chart and makes the overall patterns more apparent. Standard histograms illustrate the count of occurrences and are known as frequency histograms. It is also possible to visualise the relative frequency for a value or an interval with density histograms (density is calculated by dividing the relative frequency by the interval width). Histograms are a straightforward method of identifying the modality and the shape of a dataset, as well as the spread and kurtosis (Figure 7).

Business Intelligence and Data Mining, 2014

Figure 7. Sample histograms (source: Internet)

Boxplots are comprehensive tools that present central tendency and spread of distribution for numeric attributes based on a five-number summary. Figure 8 illustrates all elements of statistic summary that can be visualised on a boxplot for a complete view of the data. The rectangular box represents the middle 50 per cent of the values, and therefore its height is the IRQ. The solid line inside it marks the median, and the plus sign marks the mean. Unequally split box and different values of the median and the mean indicate skewed distribution. The dashed lines on both sides of the box are known as whiskers. The top whisker extends to the highest value or Q3 + 1.5xIQR, whichever it reaches first. The bottom whisker extends to the lowest value or Q1 – 1.5xIQR, whichever it reaches first. This notion comes from the assumption that points which fall beyond 1.5 box-lengths from the upper or lower edge of the box are considered as possible outliers. Those values are denoted as circles on the graph. Points that are beyond 3xIQR are perceived as extreme outliers and are plotted with a different symbol (Saltman 2003).

Business Intelligence and Data Mining, 2014

Figure 7. Sample boxplot (source: Internet)

5. Summary - Characteristics of Good Measures Since an average is the principal measure illustrating the centre of the data it needs to be representative. Firstly, it should be calculated based on all samples in a dataset and be robust to extreme values. Those conditions are necessary as the central tendency, especially mean, is a measure broadly used in further analysis and data transformation. As explained in Section 3 and Section 4, the appropriate centre measure is dependent on the characteristics of data. Whereas nominal attributes can only be described by mode, numerical variables allow calculating mean, median and mode. In estimating the most appropriate indicator of central tendency for a given set of quantitive data, the nature of the values distribution needs to be taken into consideration (Cooper & Weeks 1983).

As mentioned before, standard deviation is a measure based on mean and therefore sensitive to extreme values. Interquartile range is more robust measure of spread, as increasing the distance of the endpoints on either side of the median does not affect IQR. In contrast, the overall range is directly affected by outliers. Nonetheless, the range is the easiest way

Business Intelligence and Data Mining, 2014

of detecting extreme values and errors, provided there is some knowledge of what the reasonable scope of the values should be. Graphical methods are widely popular in exploring data. They allow easy and intuitive retrieve of information and detection of possible data issues which would be difficult to obtain in any other way. Many graphical methods which summarise the information for analysis can also provide insight into the data for the audience. It is therefore important, to use them with understanding and regard to the business objective. Visualising data in a clear and consistent way fosters the process of analysis, especially for large datasets that cannot be simply comprehended with mathematical measures. Once again, it is worth noting that statistical summary and visualisation techn...