Tools for multivariate data analysis PDF

Title	Tools for multivariate data analysis
Course	Tools for Multivariate Data Analysis
Institution	Universität Koblenz-Landau
Pages	10
File Size	208.3 KB
File Type	PDF
Total Downloads	22
Total Views	144

Preview

CLICK TO PREVIEW PDF

Summary

Download Tools for multivariate data analysis PDF

Description

Session 7: Introduction to multivariate analysis, ordination and PCA 1. Why should you favour multivariate approaches for multivariate data? -

-

Not all research questions can be answered with univariate statistical methods e.g. What are the most important environmental variables determining community composition?  In univariate analyses, only the relationship between single species and environmental variables can be examined  Multivariate analyses allow for the simultaneous analysis of how environmental variables act on a community of organisms. Multivariate methods allow for dimension reduction and visualisation of multidimensional data. e.g. Ordination, Cluster dendrogram Joint (multivariate) analysis can reduce noise and increase power of statistical tests

2. Outline the differences to the univariate case when diagnosing multivariate outliers and normality.  Univariate case  Outlier = outside of box  Consider both variables individually  Normality = q-q plot  Multivariate  Outlier = outside of ellipse  Consider both variables jointly (assess jointly what is outlier)  Normality = q-q plots for sample Mahalanobis distances (to centroid) and theoretical quantiles from the chi square

distribution Outlier checking: Use joint multivariate distribution of variables (i.e. extreme points from centre of the multivariate centre). Check for points outside the box = check outliers for one variable Check for points outside the ellipse (out of the Mahalanobis distance calculation) = Check outliers for multiple variable Distributional assumption checking: Evaluate the multivariate normal distribution (e.g. a pair of univariate normal distribution forms the bivariate normal distribution, this will have different properties than the univariate normal distribution)

Visual check of multivariate normality with Q-Q plots for sample Mahalanobis distances (to the centroid of the data) and to the quantiles from the X2 distribution (ordered distances vs. Chi-squared quantile). 3. Outline the aim of ordination -

Extraction of new axes from high dimensional data that sequentially maximise the variance Dimension reduction (e.g. omission of axes that capture low amount of variance) Aggregation of variables into gradients Graphical representation in lower dimensionality

The goal of ordination is to generate a reduced number of new synthetic axes that are used to display the distribution of objects along the main gradients in the dataset. 4. Distinguish constrained and unconstrained ordination Unconstrained ordination: extraction without consideration of variables outside of data set. An example of unconstrained ordination is to agglomerate similar data points. Constrained ordination: extraction of information, which can be explained by variables of a second data set . An example is to explain the response of organisms to environmental variables. 5. Which criteria should guide the selection of an ordination method? The choice of ordination methods depends on 1) the type of data, 2) the similarity distance matrix you want/can use, and 3) the expected result.  Research goal  Assumed shape of relationship  Linear  Unimodal (up and down)  Any (2 peak or fluctuate curve)  Input data  Raw

 

Distance matric Ordination output

6. Explain the variance-covariance matrix. A variance-covariance matrix is a symmetric matrix resulting from the multiplication of a matrix to its transpose. The variance-covariance matrix is a diagonal covariance with ones in the diagonal. Each element has the covariance of each variable. 7. What are eigenvalues and eigenvectors? Eigenvectors are vectors that do not get knocked off their span when undergoing transformation instead they get stretched in their own span. Eigenvalues are the factor by which the vector gets stretched to its position under linear transformation. Eigenvectors are nothing but ‘linear transformation of the original variables’ Eigenvalues of the PC tells us how much of the total variation observed in our data is being explained by our chosen PC. For example, if total variation of data is 10 units, and if eigenvalue of PC1 is 8 units, then it means that 80% (i.e.8/10) of the variation of our data is efficiently being explained by our PC1 and the remaining 20% of variation is explained by PC2. 8. How do eigenvalues relate to the variance of a PC? They are the explained variance per PC. The total variance (sum of eigenvectors) is equal to the number of PCs in the PCA as they have been normalized to a standard deviation of 1. Eigenvalues give the variance that PC represents. The largest eigenvalues (and the corresponding eigenvector) explains highest share of total variance. 

Give the amount of variance that is captured by the axis

9. Outline criteria to determine the optimal number of PCs.

-

Sum criterion Broken-stick criterion Scree plot Cross-validation (k-fold) criterion

10. What is sparse PCA and how does it influence the evaluation of descriptor contribution to PCs? It is a method similar to the Lasso method in the sense that they both penalize each term (PCs in this case) with a factor (lambda). It typically results in a lower number of variables correlating with each axis. Also, typically results in lower eigenvalues for the first PC. It is not possible to know how much variance is explained in a Sparse PCA. This is because it is unknown to which extent the penalty term used results in the reduction of the explained variance. A sparse PCA would explain 100% of the original variance when no variable is redundant. This is when all the variables from the original data set have a zero correlation, therefore having complete linear independence being all they orthogonal. 11. Explain biplots with respect to (a) correlation between variables and (b) relation between sites/species and variables. a) correlation between variables 180˚ = negative correlation 90˚ = no correlation 0˚ = positive correlation b) relation between sites/species and variables The intensity of a relationship between a sample and a variable is given by the distance from the point where the perpendicular projection of a sample intersects the variable vector (or its extension) and the origin.

12. How does scaling influence interpretation of a biplot? What does the argument scaling = TRUE in rda() {vegan} means? It applies an operation on the original variables. It treats the data to yield a mean of zero (centring with center = TRUE) and a standard deviation of 1.

There's another "scaling" in the biplot() function that allows to pay special attention to different relationships between elements of the biplot: 1. Emphasizes the relationship in between objects. 2. Emphasizes the relationship in between descriptors. 3. It is a compromise between 1 and 2. 13. Which objects from a PCA would be extracted as non-collinear predictors for a multiple regression analysis? "Scores". As a result, from PCA, "PC scores" result from the multiplication of scores from initial axes with eigenvectors. Those extracted (unscaled) PC scores are used in multiple regression analysis. Extract (unscaled) PC scores to use PCs as descriptors in multiple regression analysis Session 8: RDA, similarity measures and NMDS 1. How many constrained axes has an RDA and how are they related to the descriptors? After performing PCA in a simple RDA, canonical eigenvectors and eigenvalues are produced, along with matrix z containing canonical axes. The constrained axes are linear combinations of explanatory variables in x The no. of axes equal to the variables in second dataset. 2. How does scaling influence the interpretation of a triplot? The results of RDA are presented in the form of a Triplot, which is nothing but a superimposition of 2 biplots. There are 2 types of Triplots: Triplot with Scaling 1 (i.e. Distance based) - Eigenvectors are scaled to unit length, - Preserves the Euclidean distance between points, - Angles between response variables does not have any meaning

-

Angles between response and explanatory variables reflect the correlation Enlarging or shrinking the plot will not change the interpretation of the results.

Triplot with Scaling 2 (i.e. Covariance based) - Preserves covariance of the fitted Ys - Angle between response and explanatory var., and between response variables themselves show the corresponding correlation - Enlarging or shrinking the plot will change the interpretation of the results 3. Which association is measured with similarity measures? Relationship between objects (sites) 4. Outline the calculation of the Bray-Curtis and the Jaccard coefficient. Jaccard coefficient - used for binary data - ignores joint absences (d) Bray-Curtis coefficient - used for abundance data - range: 0 - 1 (if all xk ≥ 0) - data transformation often required to reduce weight of dominant taxa xik and xjk is the abundance of taxon k for site i and j.

5. Explain the double-zero problem. Joint absences are completely arbitrary and therefore it does not mean that there exists similarity between two sites where a species is mutually absent. There can be many different factors affecting the absence of a species.

6. What is the species abundance paradox? When we calculate Euclidean distance between sites, it can result in two sites that they don’t share a same species but having smaller distance than sites that do. This translates into sites not sharing species being more similar than the ones that do, which isn’t reliable. Euclidean distance is problematic for ecological data. 7. What are the main differences between NMDS and PCA? NMDS Non-metric multidimensional scaling NMDS only preserves ordered distance

PCA Principle component analysis PCA partitions variance among axes

8. Which three matrices are computed during NMDS? - Distance matrix from raw data - Distance matrix from initial configuration in lower dimensional space - Disparity matrix 9. Outline the major elements of the algorithm used to compute the NMDS. 1. 2. 3. 4. 5. 6. 7.

Determine distance matrix from raw data Choose initial configuration (often based on MDS/PCoA) in lower dimensional space Determine distance matrix for this configuration Determine disparities using monotone regression and pool adjacent violators (PAV) algorithm Find a new configuration with higher similarity to the initial distance matrix Go to 3. (if fit does not improve on many iterations → 7.) Evaluate goodness of fit of final configuration

10. Discuss limitations of NMDS. -

Results dependent on initial configuration Loss of information occurs due to ordered rank ordination → Information on absolute distances lost

-

→ No partitioning of variance Interpretation of results become difficult if more than 2 or 3 dimensions are needed (i.e. to yield low STRESS1 value) Significant fit of environmental variables to ordered distances become more difficult to interpret than for unconstrained variancebased methods

Session 9: Unsupervised classification: Cluster analysis 1. What is the aim of cluster analysis? - Identification of groups - Visualizing similarity /distance of objects - Data aggregation for its dimensionality reduction (to plot a dendrogram) - Identification of outliers - Reduce noise 2. What is the difference between hierarchical and non-hierarchical cluster analysis? Hierarchical clustering is a step-wise (sequential) method (to end up with one single cluster) and it is a deterministic method. It can be agglomerative (creating clusters for every object and then adding those clusters together) or divisive (starting with one cluster and divide it into more clusters until k equals the number of samples). Groups are identified by aggregating groups/objects with the least distance. Non-hierarchical cluster analysis implies simultaneous clustering and performs a random initial group assignment. It assigns objects to a pre-defined number of clusters k. 3. Outline the algorithms for hierarchical clustering and k-means. Hierarchical clustering 1. Search for shortest distance (Euclidean distance or Bray-Curtis) between pairs of objects (environmental variables or species). Then merge them into clusters. 2. Re-calculate distances including the recently-formed cluster in the distance matrix and repeat step 1. 3. The algorithm converges when all objects are merged into one cluster. k-means clustering 1. Partition objects into k groups 2. Move objects and calculate change

3. Choose best solution 4. Repeat 2-3 until cluster criterion does not improve 4. Describe the calculation of distances between clusters for single, average and complete linkage. How does the choice of the method influence the interpretation of results? Single linkage - This approach adds objects to clusters even if this leads to shorter distances between clusters than between objects in the raw data. It is employed to find discontinuities in the data. - Space-contracting Complete linkage - This approach minimises the distances of the objects within the clusters i.e. constructs clusters as homogeneous as possible. It can be used to combine objects with very similar characteristics. - Space-dilating Average Linkage - The original distances (as given in the distance matrix) are preserved → high correlation between the original distances and the distances after clustering, which are given in the so-called cophenetic matrix. This approach should be employed when the major aim is to preserve the characteristics of the input data. - Space-conserving 4. Explain the analogy between k-means and ANOVA. k-means has a Criterion for the algorithm to converge, which is the minimisation of within-cluster Sum of Squares (SSW). Both can also rely on Euclidean distances. This is analogous to "ANOVA in reverse" in the sense that the significance test in ANOVA evaluates the between group variability against the within-group variability when computing the significance test for the hypothesis that the means in the groups are different from each other. In k-means clustering, the program tries to move objects (e.g., cases) in and out of groups (clusters) to get the most significant ANOVA results. 5. List cluster validity indices. What is the difference between external and internal validation? - Caliniski-Harabsz (Internal CVI) - GAP Index (Internal CVI) - Silhouette Width S (Internal CVI) - Cluster stability via bootstrapping (Internal CVI) - Rand index (External CVI) Internal CVIs

These criteria are used when there is no a priori knowledge about the original grouping. They compare distinct cluster solutions. External CVIs These criteria are used with a priori knowledge about the original grouping. Compares the resulting clustering structure to the prior knowledge about how many clusters to expect. 6. Discuss limitations of cluster analysis. - Cluster analysis remain partially subjective (Lack of formalisation, having many choices to perform clustering with different influence in the outcomes). It is still a good tool for data exploration analysis. - It has a specific tendency to detect spherical clusters. Second-level spherical clusters will be added into a more inclusive one (selfcontaining spherical structures)....