Big Data Machine learning PDF

Title	Big Data Machine learning
Author	Jac Rees
Course	Big Data and Machine Learning
Institution	Swansea University
Pages	38
File Size	1.2 MB
File Type	PDF
Total Downloads	362
Total Views	786

Preview

CLICK TO PREVIEW PDF

Summary

ig Data & Machine learning ԆBig Data & Machine learningstudocu/en-gb/document/swansea-university/big-data-and- machine-learning/past-exams/final-ԇԆԆԅ-january-ԇԅԇԅ-questions-and- answers/ԎԏԆԍԊԊԇ/viewIntro and FundamentalsCan you explain the Big Data concept from the machine learning...

Description

Big Data & Machine learning https://www.studocu.com/en-gb/document/swansea-university/big-data-andmachine-learning/past-exams/final-2110-january-2020-questions-andanswers/7816442/view

Intro and Fundamentals Can you explain the Big Data concept from the machine learning perspective? Extreme amounts of data Normal methods of processing( analysis, storage, querying etc.) are difficult Can be analysed to reveal information such as trends and associations Term is evolving and described any voluminous amounts of: structured, semi-structured and unstructured data

What are the key characteristics of Big Data? Volume: Size of the data which needs to be analysed → All credit card transactions in Europe Velocity: Speed at which the data is generated → Instant messaging Variety: Obtained from a variety of sources Veracity: Quality of the data which is being analysed

What is Machine Learning and what are the different types of machine learning?  Collect the data Supervised or Unsupervised)  Identify the features of the data  Choose the ML method/Model (KMeans, LDA, PCA, NN, etc.)

 Predict the model which represents the data the best  Compare the model with other methods to ensure it was the best choice  Check the model against test data

What is Supervised Learning? Description Training using data with predetermined labels Advantages Usually gains better results than other methods Disadvantages Can be costly AND time consuming Types Classification and CategorisationDiscrete data) RegressionContinuous data) Examples Email Spam Detection Digit Recognition Mask Identification

What is unsupervised learning? Description Training data without labels Advantages Easier since labels are not always available Disadvantages

Less effective in its results - Not much point in using it if labels are available Types ClusteringDiscrete) Dimensionality ReductionContinuous) Examples Segment customers to better adjust products and offerings Anomaly detection Social Network analysis

Define 2 Other Types Of Machine Learning Semi-Supervised Learning Description Training data contains a few labels

Reinforcement Learning Description Rewards from a sequence of actions Example In chess → There is only one "supervised" signal at the end of the game Need to make a move at every step Deals with "credit assignment"

Supervised and Unsupervised Methods

Clustering

What is the key concept of clustering? Unsupervised - No labels are given Groups Similar data points together Data instances that are (very) different or far away from each other are in different clusters

Why is clustering useful for analysing bigdata? Gain insights into the structure of the data

What are the common clustering techniques? KMeans Fuzzy-C Means Gaussian Mixture Model

What is K-Means Clustering? KMeans is a partitional clustering algorithm Number of clusters is pre-set(value for K Each cluster is represented by the centre of the cluster Each data point is assigned to the nearest cluster centre or centroid Iterative process → Start with a random initialisation Aim to minimise the Sum Of Squared Errors

What are the key steps in K-means clustering?  Choose an initial number of clusters  Randomly initiate the centroids  Calculate the Sum Of Squared Errors  Assign each data point to its nearest centroid  Move each centroid to the mean position of all data points assigned to it  Recalculate the SSE

 If the change in SSE > threshold and iteration < max_iterations then repeat from step 4

What is Gaussian Mixture Modelling? A probabilistic model based approach to data clustering A Weighted sum of multiple single Gaussian Function Each cluster is a generative model Represents normally distributed subpopulations within an overall population Model is described by several parameters Calculates the probability that a data point belongs to a cluster (Soft Assignment)

How can GMM be used for data clustering? Identifies the probability that a data point belongs to a certain distribution (cluster) Uses mean and variance

What are the differences between K-means and GMM clustering? GMM uses a model GMM is a soft assignment( probability it belongs/strength of the association), KMeans is a hard assignment(belongs or doesn't) KMeans doesn't account for variance within the set

Can you identify the GMM parameters and their roles? k : The number of gaussian functions (and therefore clusters) μ : The mean of each Gaussian σ : The Standard Deviation of each Gaussian Function P : The mixing coefficient of each Gaussian Function All P's in a model add upto 1 AKA Prior Probability

Define Prior and Posterior Probability Prior Probability: Probability distribution before initial evidence has been taken into accound Posterior Probability: Probability distribution after initial evidence has been taken into account

Can you describe the steps in GMM clustering?  Optional) Use KMeans To Find Initial State  Determine initial values:  K is given 

μ is the centroid of each KMeans Cluster



σ can be Calculated from the means Smaller Sigma equates to higher peak vice versa)



P is the proportion of data points assigned to each of the gaussian functions

 Calculate the posterior probability of each point (the probability that a gaussian is the one which the point belongs to)  Update variables using the new posterior probabilities  Repeat steps 3 & 4 until the values converge

What impact do outliers have on clustering results? Can skew the results (inaccurate clustering) KMeans is particularly sensitive to outliers

What steps can be taken to minimise the outlier impact? Delete the outlier and run the algorithm again If there is a big difference then you have found an outlier

Usages of GMM Background Removal In Videos

Every pixel in the frame is evaluated If the probability of the pixels are below the threshold, those pixels are kept for display Otherwise the pixel is set as background(e.g. set to green) Image Segmentation Build a colour histogram using all the pixels Model the histogram using GMM Classify each pixel based on the posterior probability

p(j∣x) =

p(x∣j)P (j) p(x)

Linear Regression What is linear regression? A form of Supervised Learning More concretely: Regression Predict the value of yn+1 given a new point xn+1 after evaluating obtaining the parameters wi from training on the examples (xi , yi )i=1 = 1, ..., n Used to show or predict the relationship between two variables Each observation consists of two values (independent and dependent) A Straight line approximates the relationship between the dependent and independent variables

What is the fundamental concept of the independent, dependent and linear regression? Independent Variable: (x)The factors used to predict the dependent variable Dependent Variable: (y) The factors being predicted Estimate

y^ using a linear function of the data x: y^n+1 = w T x n+1

where w

= (w0 , w1 , w 2 , ...)T , x n+1 = (1, xn+1,1 , xn+1,2 , ...)T

ADD One as the first component to write the linear equation as a dot product between two vectors

What criteria does linear regression use to evaluate regression results? The accuracy of the hypothesis is measured using a cost function → Takes the average of all the results Apply a Cost Function (Least Mean Squares) which aims to minimise the Sum Of Squared Errors by taking the difference between the predicted value and the actual value and squaring the result To find the solution of LMS Employ Gradient Descent

What is the limitation of the criteria? Sensitive to outliers Tendency to overfit the data Struggle with input spaces involving higher dimensions

Describe Nonlinear Fitting Powerful function approximators Predictions are still linear in the coefficients

ω M

2

y(x, w) = w0 + w1 x + w2 x + ... + wM x

M

= ∑ wj xj j=0

Same maths as multidimensional linear regression Instead of extra dimensions, extra powers of x

Describe various Linear Basis Function Models Polynomial Basis: ϕj (x)

= xj

Global → Small changes to x affect all the basis functions Gaussian Basis: ϕ(x)

2

j) = exp{− (x−μ } 2s 2

Local → A small change in x only affects nearby basis functions

μ, s: location, scale(width)

Describe Quadratic Regularisation A method to reduce overfitting Is added to the end of the cost function

λ : The regularisation coefficient High λ : Reduces variations of the values

Low λ : Less affect on the variation

Define Bias, Variance and Expected Loss Bias: The average difference between the predicted and real values Variance: The variability of the values predicted Expected Loss: A formula which uses bias and variance - Low Expected Loss is the aim 2

Bias2 + variance + noise = ExpectedLoss High Bias = Too much regularisation High Variance = Too Little regularisation

PCA & LDA Why is dimensionality reduction useful? Datasets may have millions of dimensions which can effect: Redundant features (e.g. not all words are useful) Efficiency of algorithms Cost of storage Data retrieval D.R reduces the number of dimensions whilst maintaining the meaningfulness of the data Finds a low-dimensional but useful representation Discovers the intrinsic dimensionality of the data

What is the fundamental concept of PCA? A linear method which is used to reduce the dimensionality of data Reducing dimensionality of large numbers of interrelated values whilst retaining vital information Transforms them to a new set of variables (Principal Components) which are uncorrelated yet are ordered so the first few retain most of the variation present in the original variables Transform the coordinate system so that you align that in a way the variance along the axis decreased along the axis

Describe the steps for PCA  Standardise the data

 Subtract the mean (of the feature column) from the data values (sometimes with large outliers use median) - n is the number of observations

x ˉ=

∑xi n

 This results in Zero-Centred Values  Compute the covariance matrix of the features from the dataset  Perform eigendecomposition on the convariance matrix  Order the eigenvectors in decreasing order based on the magnitude of their corresponding eigenvalues  Determine

k → The number of top principal components to select

 To determine the number of principal components to select for dimensionality reduction plot the cumulative sum of the eigen values

λj ∑dj=1 λj  Perform Singular Value Decomposition  U is the left singular vectors,  V* is the complex conjugate of the right singular vectors  S are the singular values  Singular values are correlated with he eigenvalues calculated from the eigendecomposition  Compute The new

k−Dimensional feature space

What are variance and covariance? Variance: A measure of the variability that utilises all of the data (how far the data points are spread out)

It is the average of the squared differences between data values and their means

ˉ )2 ] var (x) = σ 2 = E[(X − X ˉ is mean) Where E. ) denotes the expected value (i.e. mean) ( X For Zero-Centred values, variance simplifies to:

var(X) = σ 2 = E[X 2 ] Standard Deviation: Positive squared root of the variance Measured in the same unit as the data → And as such is more easily interpreted than the variance

σ(X) =

var(X)

Covariance Measure of the joint variability of two random variables How similar the variances of the features are (how much they vary in terms of average value) Covariance between two (random) variables X1 and

X2 is defined as:

Cov(X1 , X 2 ) = E[(X1 − E(X 1))(X 2 − E(X2 ))] For Zero-Centred values, covariance simplified to:

E(X 1 X2)

How do you identify elements in a covariance matrix? Covariance Matrix for a 3Dimensional Data

Covariance of X, Y = Covariance of Y, X → Matrix is symmetrical about the main diagonal Covariance matrix

Σ defines the shape of the data

If covariance is 0 then they are uncorrelated Defines both the spread (variance) and the orientation (covariance) of the data

How are they all related to PCA? If the covariance matrix of the data is a diagonal matrix (covariances are zero) → The variances must be equal to the eiqenvalues

λ

If the covariance is not diagonal (covariances are not zero) Eigenvalues still represent the variance magnitude in the direction of the largest spread of data The variance components still represent the variance magnitude in the direction of x and y axes However, these values are not the same anymore

What are eigenvectors and eigenvalues? Eigenvectors: The vector which points to the direction of the largest spread of the data Eigenvalues: Equal the spread (variance) in the direction → Defined by its corresponding eigenvector Suppose we have plotted a scatter plot of random variables, and a line of best fit is drawn between these points. This line of best fit, shows the direction of maximum variance in the dataset. The Eigenvector is the direction

of that line, while the eigenvalue is a number that tells us how the data set is spread out on the line which is an Eigenvector.

For any matrix X

X = USV T

Data X, one row per data point (DATA IS ZERO_CENTRED US gives coordinates of rows of X in the space of principal components S is diagnol, Sk Rows of

> Sk+1 , Sk is the kth largest eigen value

V T are unit length vectors

What are The roles of Eigenvectors and values in PCA? Locate the directions in which the data is spread (Eigenvector) The importance of each direction (Eigen Value) Remove the directions which are relatively important  List the eigenvalues in descending order  Set a threshold and remove principal components that have small variances (small eigenvalues)

 Project the data back

What are the Pros and Cons of PCA? Advantages Easy to compute Speeds up other machine learning algorithms Counteracts high dimensionality issues Disadvantages Trade-off between information loss and dimensionality reduction Cannot Capture the intrinsic nonlinearity since it uses a linear projection PCA is biased in datasets with strong outliers Assumes correlation between features Usually not optimal for classification/Dimensionality reduction

What is the fundamental idea behind LDA? Finds most discriminant projection by maximising the between-class distance and minimising within-class distance Performs dimensionality reduction whilst preserving as much of the class discriminatory information as possible Optimises class separability

Describe the steps of LDA  Compute the

d−Dimensional mean vectors for the different classes within

the dataset  Compute the Scatter Matrices (inbetween-class and within-class scatter matrix)  Compute the eigenvectors (e1 , e 2 , ..., e d) and its corresponding eigenvalues (λ 1, λ 2, ..., λ d)  Sort the eigenvectors by decreasing eigenvalues

 Choose k eigenvectors with the largest eigenvalues to form a d × k− Dimensional Matrix

What are the fundamental differences between LDA and PCA? LDA is supervised and PCA is unsupervised LDA attempts to find a feature subspace which maximises class separability where as PCA attempts to find the directions of maximal variance

What are the advantage of LDA compared to PCA? LDA is useful to find dimensions which aim to separate clusters → Know the clusters before Labels give LDA discriminative power

When should you use LDA instead of PCA? When you have labels available Attempting to create a classification model

What are the limitations of LDA Fails when the discriminatory information is not in the mean but the variance of the data

Logistic Regression What is the fundamental concept of Logistic Regression? Form of classification Instead of output vector

y being continuous → It is a value of 0 and 1 y ∈ {0, 1}

Logistic Function maps any real number to the 0, 1

Makes it useful fro transforming an arbitrary valued function to a function better suited for classification Builds upon linear regression

How is it different to Linear Regression? Restricts the range to 0 or 1 Threshold value is needed to determine the classes of each instance properly requires a logistic function

Why is linear regression ill-placed to solve discrete classification problems? It is sensitive to imbalances within data (outliers) Values are continuous and can be above or below the maximum/minimum Decision boundary is not stable and can easily change, giving willdly different results

Why does logistic regression handle discrete classification problems better? Predicts a probabilistic value

How is the linear decision boundary encoded into logistic regression? The decision boundary is a straight line:

wT x = 0 Positive samples > 0 (but can be much higher) Negative sample < 0 (but can be much lower) Wrap the decision boundary (line) equation within the logistic function whereby:

hω (x) = g(ωT x) ∈ [0, 1]

OUTPUT IS NOW BOUNDED Predict

y = 1 : ωT ≥ 0, g(ω T x) ≥ 0.5

Predict

y = 0 : ωT x < 0, g(ω T x) < 0.5

How does logistic regression cope with nonlinear decision boundaries? Similar to the linear case → Try to fit the decision boundary but with higher order polynomials

ω T ϕ(x) = 0 Involved nonlinear combinations of features e.g.

How do you evaluate the decision boundary? Project all the positive samples (labelled as function

y = 1) to the sigmoid(logistic)

g

Projected values should be as close to 1 as possible The mean should be as close to 1 as possible Bounded between 0 and 1 HOWEVER The function g is nonlinear and can be difficult to find the optimal Transform it to

− log hω (xi )

Aim is now: Get it as close to 0 as possible

How does logistic regression deal with multiple classes(number of classes > 2)? One-Vs-All Strategy Choose the class which a point is most likely to belong to

SVM

What are the differences between clustering and classification? Clustering: Unsupervised(no labels) Establish existence of classes within data Classification: Supervised(labels) Develop accurate description/model of each class Predict class labels for unseen data

Why is SVM considered a large margin classifier? Cost function penalises data points close to the decision boundary Results in a decision boundary furthest away from data points The distance between the points closest to the lines is called the margin SVM tries to maximise the margins Enables a better generalization accuracy The observations on the edge and within the Soft margin are called Support Vectors

How does SVM cope with a nonlinear decision boundary? Uses the linear regression technique but the data is nonlinear Uses kernel functions

What are the SVM Parameters? C Large C = Lower bias, higher variance Small C = Higher bias, lower variance

σ Kernel Width) Large

σ:

Kernel varies more smoothly Higher bias, lower variance Small

σ:

Kernel is ...