Lp3 machine learning PDF

Title	Lp3 machine learning
Author	Anonymous User
Course	computer engineer
Institution	Savitribai Phule Pune University
Pages	15
File Size	283.3 KB
File Type	PDF
Total Downloads	343
Total Views	548

Preview

CLICK TO PREVIEW PDF

Summary

LP-III Lab Manual 1Assignment No.Aim: To implement Linear Regression to find the equation of the best fit line for given data.Objective:  The Basic Concepts of Linear Regression.  Implementation logic of Linear Regression to find the equation of the best fit line for given data.Theory:Introduction...

Description

Assignment No.1 Aim: To implement Linear Regression to find the equation of the best fit line for given data. Objective:  

The Basic Concepts of Linear Regression. Implementation logic of Linear Regression to find the equation of the best fit line for given data.

Theory: Introduction: Regression analysis is a very widely used statistical tool to establish a relationship model between two variables. One of these variable is called predictor variable whose value is gathered through experiments. The other variable is called response variable whose value is derived from the predictor variable. Linear models are the simplest parametric methods and always deserve the right attention, because many problems, even intrinsically non-linear ones, can be easily solved with these models. A regression is a prediction where the target is continuous and its applications are several, so it's important to understand how a linear model can fit the data, what its strengths and weaknesses are, and when it's preferable to pick an alternative. Types of Regression 

Linear



Multiple Linear



Logistic



Polynomial

In Linear Regression these two variables are related through an equation, where exponent (power) of both these variables is 1. Mathematically a linear relationship represents a straight line when plotted as a graph. A non-linear relationship where the exponent of any variable is not equal to 1 creates a curve.

LP-III Lab Manual

1

The general mathematical equation for a linear regression is – y = ax + b y is the response variable and x is the predictor variable. a and b are constants which are called the coefficients.

Applications • Trend lines: A trend line represents the variation in some quantitative data with passage of time (like GDP, oil prices, etc.). These trends usually follow a linear relationship. Hence, linear regression can be applied to predict future values. • Economics: To predict consumption spending, fixed investment spending, inventory investment, purchases of a country‘s exports, spending on imports, the demand to hold liquid assets, labor demand, and labor supply. • Finance: Capital price asset model uses linear regression to analyze and quantify the systematic risks of an investment. • Biology: Linear regression is used to model causal relationships between parameters in biological systems.

Python Packages needed • pandas – Data Analytics • numpy

– Numerical Computing • matplotlib.pyplot – Plotting graphs • sklearn – Regression Classes Steps to establish Linear Regression A simple example of regression is predicting weight of a person when his height is known. To do this we need to have the relationship between height and weight of a person. The steps to create the relationship is −  Carry out the experiment of gathering a sample of observed values of height and corresponding weight.  Create the object of Linear Regression Class.  Train the algorithm with dataset of X and y.  Get a summary of the relationship model to know the average error in prediction. Also called residuals.  To predict the weight of new persons, use the predict() function.

Conclusion: We have studied the Linear Regression and also implemented successfully.

Assignment No. 2 Aim: To implement Decision Tree Classifier. Objective:  

The Basic Concepts of Decision Tree Classifier. Implementation logic of Decision Tree Classifier.

Theory: Introduction: A decision tree is a flowchart-like structure in which internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules. Decision trees are diagrams that attempt to display the range of possible outcomes and subsequent decisions made after an initial decision. For example, your original decision might be whether to attend college, and the tree might attempt to show how much time would be spent doing different activities and your earning power based on your decision. There are several notable pros and cons to using decision trees. In decision analysis, a decision tree and the closely related influence diagram are used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated. A decision tree consists of 3 types of nodes: 1.

Decision nodes - commonly represented by squares

2.

Chance nodes - represented by circles

3.

End nodes - represented by triangles

Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal. If in practice decisions have to be taken online with no recall under

incomplete knowledge, a decision tree should be paralleled by a probability model as a best choice model or online selection model algorithm. Another use of decision trees is as a descriptive means for calculating conditional probabilities. Decision trees, influence diagrams, utility functions, and other decision analysis tools and methods are taught to undergraduate students in schools of business, health economics, and public health, and are examples of operations research or management science methods. ALGORITHM USED: In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. The ID3 algorithm begins with the original set as the root node. On each iteration of the algorithm, it iterates through every unused attribute of the set and calculates the entropy (or information gain) of that attribute. It then selects the attribute which has the smallest entropy (or largest information gain) value. The set is then split by the selected attribute (e.g. age < 50, 50 = 100) to produce subsets of the data. The algorithm continues to Recurse on each subset, considering only attributes never selected before. Recursion on a subset may stop in one of these cases: 

Every element in the subset belongs to the same class (+ or -), then the node is turned into a leaf and labelled with the class of the examples



There are no more attributes to be selected, but the examples still do not belong to the same class (some are + and some are -), then the node is turned into a leaf and labelled with the most common class of the examples in the subset



There are no examples in the subset, this happens when no example in the parent set was found to be matching a specific value of the selected attribute, for example if there was no example with age >= 100. Then a leaf is created, and labelled with the most common class of the examples in the parent set.

Throughout the algorithm, the decision tree is constructed with each non-terminal node representing the selected attribute on which the data was split, and terminal nodes representing the class label of the final subset of this branch.

Summary: 1.

Calculate the entropy of every attribute using the data set

2.

Split the set into subsets using the attribute for which entropy is minimum (or,

equivalently, information gain is maximum) 3.

Make a decision tree node containing that attribute.

4.

Recurse on subsets using remaining attributes

Decision trees offer advantages over other methods of analyzing alternatives. They are: □ Graphic You can represent decision alternatives, possible outcomes, and chance events

schematically.

The visual approach is particularly helpful in comprehending sequential decisions and outcome dependencies. □ Efficient You can quickly express complex alternatives clearly. You can easily modify a tree as new information becomes available. Set up a decision

decision

tree to compare how changing

input values affect various decision alternatives. Standard decision tree notation is easy to adopt. □ Revealing You can compare competing alternatives-even

without complete information-in terms of risk

and probable value. The Expected Value (EV) term combines relative investment costs, anticipated payoffs, and uncertainties into a single numerical value. The EV reveals the overall merits of competing alternatives.

□ Complementary You can use decision trees in conjunction with other project management tools. For example, the decision tree method can help evaluate project schedules.

□ Decision trees are self-explanatory and when compacted they are also easy to follow. In other words if the decision trees has a reasonable number of leaves, it can be grasped by non-professional users. Furthermore decision trees can be converted to a set of rules. Thus, this representation is considered as comprehensible. □ Decision trees can handle both nominal and numerical attributes. □ Decision trees representation is rich enough to represent any discrete-value classifier. □ Decision trees are capable of handling datasets that may have errors. □ Decision trees are capable of handling datasets that may have missing values. □ Decision trees are considered to be a nonparametric method. This means that decision trees have no assumptions about the space distribution and the classifier structure. On the other hand, decision trees have disadvantages such as: □ Most of the algorithms (like ID3 and C4.5) require that the target attribute will have only discrete values. 

As decision trees use the ―divide and conquer‖ method, they tend to perform well if a few highly relevant attributes exist, but less so if many complex interactions are present. One of the reasons for this is that other classifiers can compactly describe a classifier that would be very challenging to represent using a decision tree.

□ The greedy characteristic of decision trees leads to another disadvantage that should pointed out. This is its over-sensitivity to the training set, to irrelevant attributes and to noise.

Conclusion: We have studied the Decision tree classifier and also implemented successfully.

LP-III Lab Manual

10

be

Assignment No.3 Aim: To implement k-NN algorithm for classifying the points on given graph. Objective: 

The Basic Concepts of k-NN algorithm.



Implementation logic of k-NN algorithm for classifying the points on given graph.

Theory: Introduction KNN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point. When we say a technique is non-parametric, it means that it does not make any assumptions on the underlying data distribution. In other words, the model structure is determined from the data. Therefore, KNN could and probably should be one of the first choices for a classification study when there is little or no prior knowledge about the distribution data. The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. KNN captures the idea of similarity (sometimes called distance, proximity, or closeness) with some mathematics we might have learned in our childhood— calculating the distance between points on a graph. There are other ways of calculating distance, and one way might be preferable depending on the problem we are solving. However, the straight-line distance (also called the Euclidean distance) is a popular and familiar choice.

KNN Algorithm We can implement a KNN model by following the below steps:

LP-III Lab Manual

10

□ Load the data □ Initialize the value of k □ For getting the predicted class, iterate from 1 to total number of training data points  Calculate the distance between test data and each row of training data. Here we will use Euclidean distance as our distance metric since it‘s the most popular method. The other metrics that can be used are Chebyshev, cosine, etc.  Sort the calculated distances in ascending order based on distance values  Get top k rows from the sorted array  Get the most frequent class of these rows  Return the predicted class Choosing the right value for K To select the K that‘s right for your data, we run the KNN algorithm several times with different values of K and choose the K that reduces the number of errors we encounter while maintaining the algorithm‘s ability to accurately make predictions when it‘s given data it hasn‘t seen before. Here are some things to keep in mind: 

As we decrease the value of K to 1, our predictions become less stable. Just think for a minute, image K=1 and we have a query point surrounded by several reds and one green (I‘m thinking about the top left corner of the colored plot above), but the green is the single nearest neighbor. Reasonably, we would think the query point is most likely red, but because K=1, KNN incorrectly predicts that the query point is green.

□ Inversely, as we increase the value of K, our predictions become more stable due to majority voting / averaging, and thus, more likely to make more accurate predictions (up to a certain point). Eventually, we begin to witness an increasing number of errors. It is at this point we know we have pushed the value of K too far. □ In cases where we are taking a majority vote (e.g. picking the mode in a classification problem) among labels, we usually make K an odd number to have a tiebreaker. LP-III Lab Manual

11

Advantages 

No assumptions about data— useful, for example, for nonlinear data



Simple and easy to implement algorithm — to explain and understand/interpret



High accuracy (relatively) —it is pretty high but not competitive in comparison to better supervised learning models Versatile— useful for classification or regression



Disadvantages 

Computationally expensive—because the algorithm stores all of the training data

□ High memory requirement □ Stores all (or almost all) of the training data □ Prediction stage might be slow (with big N) □ Sensitive to irrelevant features and the scale of the data □ The algorithm gets significantly slower as the number of examples and/or predictors/independent variables increase.

Conclusion: We have studied the k-NN Classification algorithm and also implemented successfully.

Assignment No.4 Aim: To implement K-means Algorithm for clustering. Objective:  

The Basic Concepts of K-means algorithm. Implementation logic of K-means algorithm.

Theory: Introduction:

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labeled, outcomes. A cluster refers to a collection of data points aggregated together because of certain similarities. You‘ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster. Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares. In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The ‗means‘ in the K-means refers to averaging of the data; that is, finding the centroid. To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids It halts creating and optimizing clusters when either: The centroids have stabilized — there is no change in their values because the clustering has been successful. The defined number of iterations has been achieved. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori.

The k-means algorithm takes the input parameter, k, and partitions a set of n objects into k clusters so that the resulting intra-cluster similarity is high but the inter-cluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the cluster‘s centroid or center of gravity. The k-means algorithm proceed as follows First, it randomly selects k of the objects, each of which initially represents a cluster mean or center. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster mean. It then computes the new mean for each cluster. This process iterates until the criterion function converges. Typically, the square-error criterion is used, defined as

where E is the sum of the square error for all objects in the data set; p is the point in space representing a given object; mi is the mean of cluster Ci (both p and mi are multidimensional). In other words, for each object in each cluster, the distance from the object to its cluster center is squared, and the distances are summed. This criterion tries to make the resulting k clusters as compact and as separate as possible. The k-means procedure is summarized in following figure.

Above figure represent clustering of set of objects based on clustering methods. In figure, mean of each cluster marked by a ―+‖

K-means Algorithm

The k-means algorithm for partitioning, where is cluster center is represented by the mean value of the object in the cluster. Input: K: Number of clusters D: Data Set containing n objects. Output: A set of K clusters

Algorithm: 1. Arbitrary choose k objects from D as the initial cluster centers 2. Repeat 3. (re)Assign each object to the cluster to which the object is most similar, Based on the mean value of the objects in the cluster; 4. Update the cluster means, i.e., calculate the mean value of the objects for each cluster, 5. Until no change

Advantages □ Easy to implement. □ An are

instance

can

change

cluster

(move

to

another

cluster)

when

the

centroids

re-computed.

□ If variables are huge, then K-Means most of the times computationally faster than hierarchical clustering, if we keep k smalls. □ K-Means produce tighter clusters than hierarchical clustering, especially if the clusters are globular. Disadvantages 

Difficult to

□ Initial seeds

predict the number

of clu ster s. (K‐ Value)

have a strong impact on

□ The order of the

the

final results.

data has an impact on the

□ Sensitive to scale: rescaling

your

datasets

completely change results. While this itself

final results.

(normalization

or

is

not

not

have to spend extra attention to scaling your data might be bad.

bad,

standardization) realizing

will

that you

□ Difficult to predict K-Value. □ With global cluster, it didn't work well. □ Different initial partitions can result in different final ...