Title | Big Data Machine learning |
---|---|
Author | Jac Rees |
Course | Big Data and Machine Learning |
Institution | Swansea University |
Pages | 38 |
File Size | 1.2 MB |
File Type | |
Total Downloads | 362 |
Total Views | 786 |
ig Data & Machine learning ԆBig Data & Machine learningstudocu/en-gb/document/swansea-university/big-data-and- machine-learning/past-exams/final-ԇԆԆԅ-january-ԇԅԇԅ-questions-and- answers/ԎԏԆԍԊԊԇ/viewIntro and FundamentalsCan you explain the Big Data concept from the machine learning...
Big Data & Machine learning https://www.studocu.com/en-gb/document/swansea-university/big-data-andmachine-learning/past-exams/final-2110-january-2020-questions-andanswers/7816442/view
Intro and Fundamentals Can you explain the Big Data concept from the machine learning perspective? Extreme amounts of data Normal methods of processing( analysis, storage, querying etc.) are difficult Can be analysed to reveal information such as trends and associations Term is evolving and described any voluminous amounts of: structured, semi-structured and unstructured data
What are the key characteristics of Big Data? Volume: Size of the data which needs to be analysed → All credit card transactions in Europe Velocity: Speed at which the data is generated → Instant messaging Variety: Obtained from a variety of sources Veracity: Quality of the data which is being analysed
What is Machine Learning and what are the different types of machine learning? Collect the data Supervised or Unsupervised) Identify the features of the data Choose the ML method/Model (KMeans, LDA, PCA, NN, etc.)
Predict the model which represents the data the best Compare the model with other methods to ensure it was the best choice Check the model against test data
What is Supervised Learning? Description Training using data with predetermined labels Advantages Usually gains better results than other methods Disadvantages Can be costly AND time consuming Types Classification and CategorisationDiscrete data) RegressionContinuous data) Examples Email Spam Detection Digit Recognition Mask Identification
What is unsupervised learning? Description Training data without labels Advantages Easier since labels are not always available Disadvantages
Less effective in its results - Not much point in using it if labels are available Types ClusteringDiscrete) Dimensionality ReductionContinuous) Examples Segment customers to better adjust products and offerings Anomaly detection Social Network analysis
Define 2 Other Types Of Machine Learning Semi-Supervised Learning Description Training data contains a few labels
Reinforcement Learning Description Rewards from a sequence of actions Example In chess → There is only one "supervised" signal at the end of the game Need to make a move at every step Deals with "credit assignment"
Supervised and Unsupervised Methods
Clustering
What is the key concept of clustering? Unsupervised - No labels are given Groups Similar data points together Data instances that are (very) different or far away from each other are in different clusters
Why is clustering useful for analysing bigdata? Gain insights into the structure of the data
What are the common clustering techniques? KMeans Fuzzy-C Means Gaussian Mixture Model
What is K-Means Clustering? KMeans is a partitional clustering algorithm Number of clusters is pre-set(value for K Each cluster is represented by the centre of the cluster Each data point is assigned to the nearest cluster centre or centroid Iterative process → Start with a random initialisation Aim to minimise the Sum Of Squared Errors
What are the key steps in K-means clustering? Choose an initial number of clusters Randomly initiate the centroids Calculate the Sum Of Squared Errors Assign each data point to its nearest centroid Move each centroid to the mean position of all data points assigned to it Recalculate the SSE
If the change in SSE > threshold and iteration < max_iterations then repeat from step 4
What is Gaussian Mixture Modelling? A probabilistic model based approach to data clustering A Weighted sum of multiple single Gaussian Function Each cluster is a generative model Represents normally distributed subpopulations within an overall population Model is described by several parameters Calculates the probability that a data point belongs to a cluster (Soft Assignment)
How can GMM be used for data clustering? Identifies the probability that a data point belongs to a certain distribution (cluster) Uses mean and variance
What are the differences between K-means and GMM clustering? GMM uses a model GMM is a soft assignment( probability it belongs/strength of the association), KMeans is a hard assignment(belongs or doesn't) KMeans doesn't account for variance within the set
Can you identify the GMM parameters and their roles? k : The number of gaussian functions (and therefore clusters) μ : The mean of each Gaussian σ : The Standard Deviation of each Gaussian Function P : The mixing coefficient of each Gaussian Function All P's in a model add upto 1 AKA Prior Probability
Define Prior and Posterior Probability Prior Probability: Probability distribution before initial evidence has been taken into accound Posterior Probability: Probability distribution after initial evidence has been taken into account
Can you describe the steps in GMM clustering? Optional) Use KMeans To Find Initial State Determine initial values: K is given
μ is the centroid of each KMeans Cluster
σ can be Calculated from the means Smaller Sigma equates to higher peak vice versa)
P is the proportion of data points assigned to each of the gaussian functions
Calculate the posterior probability of each point (the probability that a gaussian is the one which the point belongs to) Update variables using the new posterior probabilities Repeat steps 3 & 4 until the values converge
What impact do outliers have on clustering results? Can skew the results (inaccurate clustering) KMeans is particularly sensitive to outliers
What steps can be taken to minimise the outlier impact? Delete the outlier and run the algorithm again If there is a big difference then you have found an outlier
Usages of GMM Background Removal In Videos
Every pixel in the frame is evaluated If the probability of the pixels are below the threshold, those pixels are kept for display Otherwise the pixel is set as background(e.g. set to green) Image Segmentation Build a colour histogram using all the pixels Model the histogram using GMM Classify each pixel based on the posterior probability
p(j∣x) =
p(x∣j)P (j) p(x)
Linear Regression What is linear regression? A form of Supervised Learning More concretely: Regression Predict the value of yn+1 given a new point xn+1 after evaluating obtaining the parameters wi from training on the examples (xi , yi )i=1 = 1, ..., n Used to show or predict the relationship between two variables Each observation consists of two values (independent and dependent) A Straight line approximates the relationship between the dependent and independent variables
What is the fundamental concept of the independent, dependent and linear regression? Independent Variable: (x)The factors used to predict the dependent variable Dependent Variable: (y) The factors being predicted Estimate
y^ using a linear function of the data x: y^n+1 = w T x n+1
where w
= (w0 , w1 , w 2 , ...)T , x n+1 = (1, xn+1,1 , xn+1,2 , ...)T
ADD One as the first component to write the linear equation as a dot product between two vectors
What criteria does linear regression use to evaluate regression results? The accuracy of the hypothesis is measured using a cost function → Takes the average of all the results Apply a Cost Function (Least Mean Squares) which aims to minimise the Sum Of Squared Errors by taking the difference between the predicted value and the actual value and squaring the result To find the solution of LMS Employ Gradient Descent
What is the limitation of the criteria? Sensitive to outliers Tendency to overfit the data Struggle with input spaces involving higher dimensions
Describe Nonlinear Fitting Powerful function approximators Predictions are still linear in the coefficients
ω M
2
y(x, w) = w0 + w1 x + w2 x + ... + wM x
M
= ∑ wj xj j=0
Same maths as multidimensional linear regression Instead of extra dimensions, extra powers of x
Describe various Linear Basis Function Models Polynomial Basis: ϕj (x)
= xj
Global → Small changes to x affect all the basis functions Gaussian Basis: ϕ(x)
2
j) = exp{− (x−μ } 2s 2
Local → A small change in x only affects nearby basis functions
μ, s: location, scale(width)
Describe Quadratic Regularisation A method to reduce overfitting Is added to the end of the cost function
λ : The regularisation coefficient High λ : Reduces variations of the values
Low λ : Less affect on the variation
Define Bias, Variance and Expected Loss Bias: The average difference between the predicted and real values Variance: The variability of the values predicted Expected Loss: A formula which uses bias and variance - Low Expected Loss is the aim 2
Bias2 + variance + noise = ExpectedLoss High Bias = Too much regularisation High Variance = Too Little regularisation
PCA & LDA Why is dimensionality reduction useful? Datasets may have millions of dimensions which can effect: Redundant features (e.g. not all words are useful) Efficiency of algorithms Cost of storage Data retrieval D.R reduces the number of dimensions whilst maintaining the meaningfulness of the data Finds a low-dimensional but useful representation Discovers the intrinsic dimensionality of the data
What is the fundamental concept of PCA? A linear method which is used to reduce the dimensionality of data Reducing dimensionality of large numbers of interrelated values whilst retaining vital information Transforms them to a new set of variables (Principal Components) which are uncorrelated yet are ordered so the first few retain most of the variation present in the original variables Transform the coordinate system so that you align that in a way the variance along the axis decreased along the axis
Describe the steps for PCA Standardise the data
Subtract the mean (of the feature column) from the data values (sometimes with large outliers use median) - n is the number of observations
x ˉ=
∑xi n
This results in Zero-Centred Values Compute the covariance matrix of the features from the dataset Perform eigendecomposition on the convariance matrix Order the eigenvectors in decreasing order based on the magnitude of their corresponding eigenvalues Determine
k → The number of top principal components to select
To determine the number of principal components to select for dimensionality reduction plot the cumulative sum of the eigen values
λj ∑dj=1 λj Perform Singular Value Decomposition U is the left singular vectors, V* is the complex conjugate of the right singular vectors S are the singular values Singular values are correlated with he eigenvalues calculated from the eigendecomposition Compute The new
k−Dimensional feature space
What are variance and covariance? Variance: A measure of the variability that utilises all of the data (how far the data points are spread out)
It is the average of the squared differences between data values and their means
ˉ )2 ] var (x) = σ 2 = E[(X − X ˉ is mean) Where E. ) denotes the expected value (i.e. mean) ( X For Zero-Centred values, variance simplifies to:
var(X) = σ 2 = E[X 2 ] Standard Deviation: Positive squared root of the variance Measured in the same unit as the data → And as such is more easily interpreted than the variance
σ(X) =
var(X)
Covariance Measure of the joint variability of two random variables How similar the variances of the features are (how much they vary in terms of average value) Covariance between two (random) variables X1 and
X2 is defined as:
Cov(X1 , X 2 ) = E[(X1 − E(X 1))(X 2 − E(X2 ))] For Zero-Centred values, covariance simplified to:
E(X 1 X2)
How do you identify elements in a covariance matrix? Covariance Matrix for a 3Dimensional Data
Covariance of X, Y = Covariance of Y, X → Matrix is symmetrical about the main diagonal Covariance matrix
Σ defines the shape of the data
If covariance is 0 then they are uncorrelated Defines both the spread (variance) and the orientation (covariance) of the data
How are they all related to PCA? If the covariance matrix of the data is a diagonal matrix (covariances are zero) → The variances must be equal to the eiqenvalues
λ
If the covariance is not diagonal (covariances are not zero) Eigenvalues still represent the variance magnitude in the direction of the largest spread of data The variance components still represent the variance magnitude in the direction of x and y axes However, these values are not the same anymore
What are eigenvectors and eigenvalues? Eigenvectors: The vector which points to the direction of the largest spread of the data Eigenvalues: Equal the spread (variance) in the direction → Defined by its corresponding eigenvector Suppose we have plotted a scatter plot of random variables, and a line of best fit is drawn between these points. This line of best fit, shows the direction of maximum variance in the dataset. The Eigenvector is the direction
of that line, while the eigenvalue is a number that tells us how the data set is spread out on the line which is an Eigenvector.
For any matrix X
X = USV T
Data X, one row per data point (DATA IS ZERO_CENTRED US gives coordinates of rows of X in the space of principal components S is diagnol, Sk Rows of
> Sk+1 , Sk is the kth largest eigen value
V T are unit length vectors
What are The roles of Eigenvectors and values in PCA? Locate the directions in which the data is spread (Eigenvector) The importance of each direction (Eigen Value) Remove the directions which are relatively important List the eigenvalues in descending order Set a threshold and remove principal components that have small variances (small eigenvalues)
Project the data back
What are the Pros and Cons of PCA? Advantages Easy to compute Speeds up other machine learning algorithms Counteracts high dimensionality issues Disadvantages Trade-off between information loss and dimensionality reduction Cannot Capture the intrinsic nonlinearity since it uses a linear projection PCA is biased in datasets with strong outliers Assumes correlation between features Usually not optimal for classification/Dimensionality reduction
What is the fundamental idea behind LDA? Finds most discriminant projection by maximising the between-class distance and minimising within-class distance Performs dimensionality reduction whilst preserving as much of the class discriminatory information as possible Optimises class separability
Describe the steps of LDA Compute the
d−Dimensional mean vectors for the different classes within
the dataset Compute the Scatter Matrices (inbetween-class and within-class scatter matrix) Compute the eigenvectors (e1 , e 2 , ..., e d) and its corresponding eigenvalues (λ 1, λ 2, ..., λ d) Sort the eigenvectors by decreasing eigenvalues
Choose k eigenvectors with the largest eigenvalues to form a d × k− Dimensional Matrix
What are the fundamental differences between LDA and PCA? LDA is supervised and PCA is unsupervised LDA attempts to find a feature subspace which maximises class separability where as PCA attempts to find the directions of maximal variance
What are the advantage of LDA compared to PCA? LDA is useful to find dimensions which aim to separate clusters → Know the clusters before Labels give LDA discriminative power
When should you use LDA instead of PCA? When you have labels available Attempting to create a classification model
What are the limitations of LDA Fails when the discriminatory information is not in the mean but the variance of the data
Logistic Regression What is the fundamental concept of Logistic Regression? Form of classification Instead of output vector
y being continuous → It is a value of 0 and 1 y ∈ {0, 1}
Logistic Function maps any real number to the 0, 1
Makes it useful fro transforming an arbitrary valued function to a function better suited for classification Builds upon linear regression
How is it different to Linear Regression? Restricts the range to 0 or 1 Threshold value is needed to determine the classes of each instance properly requires a logistic function
Why is linear regression ill-placed to solve discrete classification problems? It is sensitive to imbalances within data (outliers) Values are continuous and can be above or below the maximum/minimum Decision boundary is not stable and can easily change, giving willdly different results
Why does logistic regression handle discrete classification problems better? Predicts a probabilistic value
How is the linear decision boundary encoded into logistic regression? The decision boundary is a straight line:
wT x = 0 Positive samples > 0 (but can be much higher) Negative sample < 0 (but can be much lower) Wrap the decision boundary (line) equation within the logistic function whereby:
hω (x) = g(ωT x) ∈ [0, 1]
OUTPUT IS NOW BOUNDED Predict
y = 1 : ωT ≥ 0, g(ω T x) ≥ 0.5
Predict
y = 0 : ωT x < 0, g(ω T x) < 0.5
How does logistic regression cope with nonlinear decision boundaries? Similar to the linear case → Try to fit the decision boundary but with higher order polynomials
ω T ϕ(x) = 0 Involved nonlinear combinations of features e.g.
How do you evaluate the decision boundary? Project all the positive samples (labelled as function
y = 1) to the sigmoid(logistic)
g
Projected values should be as close to 1 as possible The mean should be as close to 1 as possible Bounded between 0 and 1 HOWEVER The function g is nonlinear and can be difficult to find the optimal Transform it to
− log hω (xi )
Aim is now: Get it as close to 0 as possible
How does logistic regression deal with multiple classes(number of classes > 2)? One-Vs-All Strategy Choose the class which a point is most likely to belong to
SVM
What are the differences between clustering and classification? Clustering: Unsupervised(no labels) Establish existence of classes within data Classification: Supervised(labels) Develop accurate description/model of each class Predict class labels for unseen data
Why is SVM considered a large margin classifier? Cost function penalises data points close to the decision boundary Results in a decision boundary furthest away from data points The distance between the points closest to the lines is called the margin SVM tries to maximise the margins Enables a better generalization accuracy The observations on the edge and within the Soft margin are called Support Vectors
How does SVM cope with a nonlinear decision boundary? Uses the linear regression technique but the data is nonlinear Uses kernel functions
What are the SVM Parameters? C Large C = Lower bias, higher variance Small C = Higher bias, lower variance
σ Kernel Width) Large
σ:
Kernel varies more smoothly Higher bias, lower variance Small
σ:
Kernel is ...