BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)




LAB MANUAL Academic year 2018-19

B.E. COMPUTER (SEM – I) DATA ANALYTICS LAB (Laboratory Practice I )

Subject Incharge




Department of Computer Engineering

BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)



Department of Computer Engineering

BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)


Study of Iris flower Data Set


Naive Bayes‟ classification


Trip History Analysis


Bigmart Sales Analysis



Department of Computer Engineering

BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)

Assignment No: 1 Title: Study of Iris Flower Data Set Problem Statement: Download the Iris flower dataset or any other dataset into a DataFrame. (eg ) Use Python/R and Perform following –  How many features are there and what are their types (e.g., numeric, nominal)?  Compute and display summary statistics for each feature available in the dataset. (eg. minimum value, maximum value, mean, range, standard deviation, variance and percentiles  Data Visualization-Create a histogram for each feature in the dataset to illustrate the feature distributions. Plot each histogram.  Create a boxplot for each feature in the dataset. All of the boxplots should be combined into a single plot. Compare distributions and identify outliers. Theory: This is perhaps the best known database to be found in the pattern recognition literature. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. Predicted attribute: class of iris plant. This is an exceedingly simple domain. This data differs from the data presented in Fishers article. The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features. Attribute Information: 1. sepal length in cm 2. sepal width in cm 3. petal length in cm 4. petal width in cm 5. class: -- Iris Setosa -- Iris Versicolour -- Iris Virginica RMDSSOE, Warje

Department of Computer Engineering

BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)

The extra modules needed for coding • Pandas: Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. For Fedora Users: sudo yum install numpyscipy python-matplotlibipython python-pandas sympy python-nose atlas-devel For Ubuntu Users: sudo aptget install python3pandas Pandas deals with the following three data structures −   

Series DataFrame Panel

DataFrame is a two-dimensional array with heterogeneous data. For example, Name Age Gender Rating Steve 32 Male 3.45 Lia 28 Female 4.6 Vin 45 Male 3.9 Katie 38 Female 2.78 

The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person

pandas.DataFrame A pandas DataFrame can be created using the following constructor – pandas.DataFrame( data, index, columns, dtype, copy)


Department of Computer Engineering

BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)

S.No 1


3 4 5

Parameter & Description data data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame. index For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed. columns For column labels, the optional default syntax is - np.arrange(n). This is only true if no index is passed. dtype Data type of each column. copy This command (or whatever it is) is used for copying of data, if the default is False.

Example 1 The following example shows how to create a DataFrame by passing a list of dictionaries. import pandas as pd data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] df = pd.DataFrame(data) print df Its output is as follows − a b c 0 1 2 NaN 1 5 10 20.0 • matplotlib - matplotlib.pyplot is a collection of command style functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. - In matplotlib.pyplot various states are preserved across function calls, so that it keeps track of things like the current figure and plotting area, and the plotting functions are directed to the current axes (please note that "axes" here and in most places in the documentation refers to the axes part of a figure and not the strict mathematical term for more than one axis).


Department of Computer Engineering

BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)

Example 1 Generating visualizations with pyplot is very quick: import matplotlib.pyplot as plt plt.plot([1, 2, 3, 4]) plt.ylabel('some numbers')

• Sklearn - Scikit-learn provide a range of supervised and unsupervised learning algorithms via a consistent interface in Python. - It is on NumPy, SciPy and matplotlib, this library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction. Loading Your Data Set - The first step to about anything in data science is loading in your data. This is also the starting point of this scikit-learn - To load in the data, you import the module datasets from sklearn. Then, you can use the load_digits() method from datasets to load in the data: Example 1: # import `datasets` from `sklearn` from sklearn import ________ # Load in the `digits` data digits = datasets.load_digits() # Print the `digits` data print(______)

• How to install?


Department of Computer Engineering

BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)

– sudo aptget install python3pandas – sudo aptget install python3matplotlib – sudo aptget install python3sklearn  How to Find the Mean, Median, Mode, Range, and Standard Deviation Simplify comparisons of sets of number, especially large sets of number, by calculating the center values using mean, mode and median. Use the ranges and standard deviations of the sets to examine the variability of data. Calculating Mean

The mean identifies the average value of the set of numbers. For example, consider the data set containing the values 20, 24, 25, 36, 25, 22, 23. Formula

To find the mean, use the formula: Mean equals the sum of the numbers in the data set divided by the number of values in the data set. In mathematical terms: Mean=(sum of all terms)÷(how many terms or values in the set). Adding Data Set

Add the numbers in the example data set: 20+24+25+36+25+22+23=175. Finding Divisor

Divide by the number of data points in the set. This set has seven values so divide by 7. Finding Mean

Insert the values into the formula to calculate the mean. The mean equals the sum of the values (175) divided by the number of data points (7). Since 175÷7=25, the mean of this data set equals 25. Not all mean values will equal a whole number. Calculating Range

Range shows the mathematical distance between the lowest and highest values in the data set. Range measures the variability of the data set. A wide range indicates greater variability in the data, or perhaps a single outlier far from the rest of the data. Outliers may skew, or shift, the mean value enough to impact data analysis. Identifying Low and High Values

In the sample group, the lowest value is 20 and the highest value is 36. Calculating Range

To calculate range, subtract the lowest value from the highest value. Since 3620=16, the range equals 16.


Department of Computer Engineering

BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)

Calculating Standard Deviation

Standard deviation measures the variability of the data set. Like range, a smaller standard deviation indicates less variability. Formula

Finding standard deviation requires summing the squared difference between each data point and the mean [∑(x-µ)2], adding all the squares, dividing that sum by one less than the number of values (N-1), and finally calculating the square root of the dividend. Mathematically, start with calculating the mean. Calculating the Mean

Calculate the mean by adding all the data point values, then dividing by the number of data points. In the sample data set, 20+24+25+36+25+22+23=175. Divide the sum, 175, by the number of data points, 7, or 175÷7=25. The mean equals 25. Squaring the Difference

Next, subtract the mean from each data point, then square each difference. The formula looks like this: ∑(x-µ)2, where ∑ means sum, x represents each data set value and µ represents the mean value. Continuing with the example set, the values become: 20-25=-5 and -52=25; 24-25=-1 and -12=1; 25-25=0 and 02=0; 36-25=11 and 112=121; 25-25=0 and 02=0; 22-25=-3 and -32=9; and 23-25=-2 and -22=4. Adding the Squared Differences






25+1+0+121+0+9+4=160. Division by N-1 Divide the sum of the squared differences by one less than the number of data points. The example data set has 7 values, so N-1 equals 7-1=6. The sum of the squared differences, 160, divided by 6 equals approximately 26.6667. Standard Deviation Calculate the standard deviation by finding the square root of the division by N-1. In the example, the square root of 26.6667 equals approximately 5.164. Therefore, the standard deviation equals approximately 5.164. Evaluating Standard Deviation Standard deviation helps evaluate data. Numbers in the data set that fall within one standard deviation of the mean are part of the data set. Numbers that fall outside of two standard deviations are extreme values or outliers. In the example set, the value 36 lies more than two standard deviations from the mean, so 36 is an outlier. Outliers RMDSSOE, Warje

Department of Computer Engineering

BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)

may represent erroneous data or may suggest unforeseen circumstances and should be carefully considered when interpreting data.

The Data Set

Import packages

Sample Output


Department of Computer Engineering

BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)

Conclusion: Thus we have learnt and implemented various extraction, visualization and box plot for each feature. Also compared distributions and identify outliers


Department of Computer Engineering

BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)

Assignment No: 2 Title: Naive Bayes‟ Algorithm for classification Problem Statement: Download Pima Indians Diabetes dataset. Use Naive Bayes‟ Algorithm for classification - Load the data from CSV file and split it into training and test datasets. - Summarize the properties in the training dataset so that we can calculate probabilities and make predictions. - Classify samples from a test dataset and a summarized training dataset. Theory : Implement a classification algorithm that is Naïve Bayes. Implement the following operations: 1. Split the dataset into Training and Test dataset. 2. Calculate conditional probability of each feature in training dataset. 3. Classify sample from a test dataset. 4. Display confusion matrix with predicted and actual values. Dataset The dataset includes data from 768 women with 8 characteristics, in particular: 1. 2. 3. 4. 5. 6. 7. 8.

Number of times pregnant Plasma glucose concentration a 2 hours in an oral glucose tolerance test Diastolic blood pressure (mm Hg) Triceps skin fold thickness (mm) 2-Hour serum insulin (mu U/ml) Body mass index (weight in kg/(height in m)^2) Diabetes pedigree function Age (years)

The last column of the dataset indicates if the person has been diagnosed with diabetes (1) or not (0) The Problem The type of dataset and problem is a classic supervised binary classification. Given a number of elements all with certain characteristics (features), we want to build a machine learning model to identify people affected by type 2 diabetes. RMDSSOE, Warje

Department of Computer Engineering

BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)

To solve the problem we will have to analyze the data, do any required transformation and normalization, apply a machine learning algorithm, train a model, check the performance of the trained model and iterate with other algorithms until we find the most performant for our type of dataset. What is Naive Bayes algorithm? Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:

Above,    

P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes). P(c) is the prior probability of class. P(x|c) is the likelihood which is the probability of predictor given class. P(x) is the prior probability of predictor .

How Naive Bayes algorithm works? Let’s understand it using an example. Below I have a training data set of weather and corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players will play or not based on weather condition. Let’s follow the below steps to perform it. Step 1: Convert the data set into a frequency table Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and probability of playing is 0.64.


Department of Computer Engineering

BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)

Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction. Problem: Players will play if weather is sunny. Is this statement is correct? We can solve it using above discussed method of posterior probability. P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny) Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64 Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability. Naive Bayes uses a similar method to predict the probability of different class based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes. The algorithm is categorized into the following steps: 1. Handle Data: Load the data from CSV file and split it into training and test datasets. 2. Summarize Data: summarize the properties in the training dataset so that we can calculate probabilities and make predictions. 3. Make a Prediction: Use the summaries of the dataset to generate a single prediction. 4. Make Predictions: Generate predictions given a test dataset and a summarized training dataset. 5. Evaluate Accuracy: Evaluate the accuracy of predictions made for a test dataset as the percentage correct out of all predictions made. 6. Tie it Together: Use all of the code elements to present a complete and standalone implementation of the Naive Bayes algorithm


Department of Computer Engineering

BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)

Applications:  Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time. 

Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable.

Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)

Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not

Input: Structured Dataset : PimaIndiansDiabetes Dataset File: PimaIndiansDiabetes.csv Output: 1. Splitted dataset according to Split ratio. 2. Conditional probability of each feature. 3. visualization of the performance of an algorithm with confusion matrix


Hence, we have studied classification algorithm that is Naïve Bayes classification.

Questions: 1. What is Bayes Theorem? 2. What is confusion matrix? 3. Which function is used to split the dataset in R? 4. What are steps of Naïve Bayes algorithm? 5. What is conditional probability?


Department of Computer Engineering

BE (Computer) (Semester I)

Data Analytics Laboratory (2018-2019)

Assignment No: 3 Title: Trip History Analysis

Problem Statement: Use trip history dataset that is from a bike sharing service in the United States. The data is provided quarter-wise from 2010 (Q4) onwards. Each file has 7 columns. Pred...

