Sample answer of data PE PG PDF

Title	Sample answer of data PE PG
Course	Algorithm & Data Structure
Institution	Universiti Teknologi Petronas
Pages	48
File Size	2.1 MB
File Type	PDF
Total Downloads	28
Total Views	134

Preview

CLICK TO PREVIEW PDF

Summary

informative data naalysis. ttttttttttttttttttttttt t teokgieowkt to ek optek o eotk wqop ktgqpk oqpgke opik...

Description

FEM 2063 Data Analytics Extended Assignment Name: Siva Prasad Raveendran Matrix ID: 16005134 Lecturer: Assoc. Prof. Dr. Mahmod Othman / Assoc. Prof. Dr. Ibrahima Faye

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

Question 1 Dataset: https://www.kaggle.com/hellbuoy/car-price-prediction Wheel base VS Price

SUMMARY OUTPUT Regression Statistics Multiple R 0.5778156 R Square 0.33387087 Adjusted R Square 0.33058944 Standard Error 6536.28037 Observations 205 ANOVA df Regression Residual Total

SS MS F Significance F 1 4346878264 4.35E+09 101.7457 1.1828E-19 203 8672761098 42722961 204 13019639362

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% -62426.648 7518.981688 -8.30254 1.42E-14 -77251.9658 -47601.33 -77251.966 -47601.3299 766.565168 75.99604921 10.08691 1.18E-19 616.722325 916.4080106 616.722325 916.408010

Intercept X Variable 1

Table 1: Simple Linear Regression (Wheel base VS Price)

y = 766.57x - 62427 R² = 0.3339

50000 45000 40000 35000

Price

30000 25000

price

20000

Linear (price)

15000 10000 5000 0 0

50

100

150

Wheel base

Figure 1: Scatter Plot (Wheel base VS Price)

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

Curb Weight VS Price

SUMMARY OUTPUT Regression Statistics Multiple R 0.8353049 R Square 0.6977342 Adjusted R Square 0.6962452 Standard Error 4402.9721 Observations 205 ANOVA df Regression Residual Total

1 203 204

SS MS F Significance F 9084248194 9.08E+09 468.5944 1.21444E-54 3935391168 19386163 13019639362

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% -19475.86 1543.962203 -12.6142 2.58E-27 -22520.1218 -16431.603 -22520.12176 -16431.603 12.816173 0.592051908 21.64704 1.21E-54 11.64881265 13.9835325 11.64881265 13.98353246

Intercept X Variable 1

Table 2: Simple Linear Regression (Weight curb VS Price)

50000 y = 12.816x - 19476 R² = 0.6977 40000

Price

30000 price

20000

Linear (price) 10000 0 0 -10000

1000

2000

3000

4000

5000

Curb weight

Figure 2: Scatter Plot (Curb Weight VS Price)

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

Engine size VS Price

SUMMARY OUTPUT Regression Statistics Multiple R 0.8741448 R Square 0.76412914 Adjusted R Square 0.76296721 Standard Error 3889.45371 Observations 205 ANOVA df Regression Residual Total

1 203 204

SS MS F Significance F 9948685774 9.95E+09 657.6404 1.3548E-65 3070953588 15127850 13019639362

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% -8005.4455 873.22075 -9.16772 5.47E-17 -9727.19134 -6283.69972 -9727.19134 -6283.69972 167.698416 6.539351952 25.6445 1.35E-65 154.804653 180.59218 154.804653 180.5921799

Intercept X Variable 1

Table 3: Simple Linear Regression (Engine Size VS Price)

y = 167.7x - 8005.4 R² = 0.7641

50000 45000 40000 35000

Price

30000 25000

price

20000

Linear (price)

15000 10000 5000 0 0

100

200

300

400

Engine Size

Figure 3: Scatter Plot (Engine Size VS Price)

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

Car Width VS Price

SUMMARY OUTPUT

Regression Statistics Multiple R 0.7593253 R Square 0.5765749 Adjusted R Square 0.5744891 Standard Error 5211.2241 Observations 205 ANOVA

df

SS MS F Significance F 1 7506797404 7.51E+09 276.4236 9.6274E-40 203 5512841958 27156857 204 13019639362

Regression Residual Total

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% -173095.2 11215.58038 -15.4335 4.58E-36 -195209.208 -150981.3 -195209.21 -150981.2655 2827.7675 170.0811173 16.62599 9.63E-40 2492.41533 3163.1196 2492.41533 3163.119624

Intercept X Variable 1

Table 4: Simple Linear Regression (Car Width VS Price)

50000 y = 2827.8x - 173095 R² = 0.5766

40000

Price

30000 price

20000

Linear (price) 10000 0 55 -10000

60

65

70

75

Car Width

Figure 4: Scatter Plot (Car Width VS Price)

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

Horsepower VS Price

SUMMARY OUTPUT Regression Statistics Multiple R 0.8081388 R Square 0.6530884 Adjusted R Square 0.6513794 Standard Error 4716.9459 Observations 205 ANOVA df Regression Residual Total

SS MS F Significance F 1 8502974873 8.5E+09 382.1634 1.4834E-48 203 4516664489 22249579 204 13019639362

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% -3721.761 929.8492418 -4.00254 8.78E-05 -5555.1628 -1888.36 -5555.1628 -1888.3602 163.26306 8.351478808 19.549 1.48E-48 146.796293 179.72983 146.796293 179.729829

Intercept X Variable 1

Table 5: Simple Linear Regression (Horsepower VS Price)

50000

y = 163.26x - 3721.8 R² = 0.6531

45000 40000

Price

35000 30000 25000 price

20000

Linear (price)

15000 10000 5000 0 0

100

200

300

400

Horsepower

Figure 5: Scatter Plot (Horsepower VS Price)

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

SUMMARY OUTPUT Regression Statistics Multiple R 0.902119801 R Square 0.813820135 Adjusted R Square 0.811041331 Standard Error 3472.704297 Observations 205 ANOVA df Regression Residual Total

Intercept X Variable 1 X Variable 2 X Variable 3

3 201 204

SS MS F Significance F 10595644660 3.53E+09 292.8671 4.3566E-73 2423994702 12059675 13019639362

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% -13463.78869 1333.050588 -10.1 1.17E-19 -16092.3465 -10835.231 -16092.34651 -10835.23088 4.262551513 0.906525233 4.702077 4.78E-06 2.47503196 6.0500711 2.475031964 6.050071062 84.87963009 12.76136888 6.651295 2.69E-10 59.7162971 110.04296 59.71629707 110.0429631 48.74726003 10.696839 4.557165 8.99E-06 27.6548424 69.839678 27.65484239 69.83967767

Table 6: Multiple Linear Regression (Curb Weight / Engine Size / Horsepower VS Price) Linear regression is a very simple approach for supervised learning. It is very useful for predicting a quantitative response. It might consider as dull compared to the other modern learning but this is still useful and widely used. To write Single Linear Regression mathematically, we can write as Y ≈ β0 + β1X. For this question, I supposed to find a dataset with at least 100 observations and at least 5 attributes. So, I used a dataset from Kaggle with 205 observations and 17 attributes. The dataset is about the price of the cars based on the attributes. To begin with the analyzing, I chose the important attribute that might affect the Price of the car. The attributes that I chose to build a single regression model with price is ‘Wheel base’, ‘Curb weight’, ‘Engine size’, ‘Car width’, ‘Horsepower’. We can write it mathematically as price ≈ β0 + β1(wheelbase) price ≈ β0 + β1(curbweight) price ≈ β0 + β1(enginesize) price ≈ β0 + β1(carwidth) All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

price ≈ β0 + β1(horsepower) Model specification is the process of determining which independent variables to include and exclude from a regression equation. The number of independent variables must be just right because too few and too many may lead to biased model and less precise model. Although I have to propose only 3 independent variable which have effects on the targeted variable (price), I chose to do 5 independent variables to see which one have most effects in the price. As can see from the Table 1 – Table 5, we can see the coefficient of determination. The more the number the better and have the most effect. Generally, we have to choose the models that have higher adjusted and predicted R-squared values. As, for this question I propose ‘curbweight’, ‘enginesize’, ‘horsepower’ as our independent variable since the r2 of these variables is higher than the other two. As we see our scatter plots, the ‘curbweight’, ‘enginesize’, ‘horsepower’ seems to have a positive correlation with price while the carheight and wheelbase doesn’t show any significant trend with price. In regression, P-values of independent less than the significance level indicate that the term is statistically significant. With the significance level with a very low values, we have very low chance to reject the independent variable. As we do a multiple linear regression, using the three variables to correlate with price as our targeted variable we can see a increment in the R-squared values. This shows that all these three variables have a good and greater effect on the price of the car. As we see the significance value of the MLR also significantly lower and this proves that it is very hard to reject the hypothesis that these three variables have greater impact on the price.

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

Question 2 Dataset: https://www.kaggle.com/johndasilva/diabetes # Part A(i) # Logistics Regression # Import the file from google.colab import files uploaded = files.upload() from google.colab import files # Import python packages import pandas as pd # Read the file df=pd.read_csv('Diabetes.csv') # See how it looks like print(df)

# Check for missing values df.isnull().values.any()

# More imports on the python packages import matplotlib.pyplot as plt import numpy as np # Import LogiscticsRegression from sklearn.linear_model import LogisticRegression # Import statistics to calculate mean import statistics from statistics import mean # Extract the columns of independant variables and target variable Colglucose = df['Glucose'] Coloutcome= df['Outcome']

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

# Extract the first 100 rows of independant variables and target variable x=np.array(Colglucose[:100]).reshape(-1, 1) y=np.array(Coloutcome[:100]) # Mean for the first 150 rows ymean=mean(Coloutcome[:150]) print('Mean: ',ymean) y=np.where(y>=ymean,1,0) # Test data from 101 to 150 xtest=np.array(Colglucose[100:150]).reshape(-1, 1) y2=np.array(Coloutcome[100:150]) ytest=np.where(y2>=ymean,1,0) #define the model model = LogisticRegression() #fit the (training) data model.fit(x,y) beta0=model.intercept_ beta1=model.coef_ print('Coefficent: ',beta1) print('Intercept: ',beta0) #predict the test data and accuracy ypred=model.predict(xtest) print('Prediction: ',ypred) print ('Accuracy: ',model.score(xtest, ytest))

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

# Part A(ii) #Naive Bayes # Import the file from google.colab import files uploaded = files.upload() from google.colab import files # Import python packages import pandas as pd # Read the file df=pd.read_csv('Diabetes.csv') # See how it looks like print(df)

# Check for missing values df.isnull().values.any()

#More imports import matplotlib.pyplot as plt import numpy as np # Import Gaussian Naives Bayes from sklearn.naive_bayes import GaussianNB # Import statistics to calculate mean import statistics from statistics import mean

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

# Extract the columns of independant variables and target variable Colglucose = df['Glucose'] Coloutcome= df['Outcome'] # Extract the first 100 rows of independant variables and target variable x=np.array(Colglucose[:100]).reshape(-1, 1) y=np.array(Coloutcome[:100]) # Mean for the first 150 rows ymean=mean(Coloutcome[:150]) print('Mean: ',ymean) y=np.where(y>=ymean,1,0) # Test data from 101 to 150 xtest=np.array(Colglucose[100:150]).reshape(-1, 1) y2=np.array(Coloutcome[100:150]) ytest=np.where(y2>=ymean,1,0) #define the model model=GaussianNB() #fit the (training) data model.fit(x,y) #predict the test data and accuracy ypred=model.predict(xtest) print('Prediction: ',ypred) print ('Accuracy: ',model.score(xtest, ytest))

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

# Part B(i) # KNN # Import python packages from google.colab.patches import cv2_imshow import cv2 import numpy as np import pandas as pd # Import python packages needed for KNN from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split import itertools # Import the file from google.colab import files uploaded = files.upload() # Read the file dataset = pd.read_csv('Diabetes.csv') # See how it looks like print('Dataset of total sample is: ', dataset.shape[0]) print('Dataset features are: ', list(dataset.columns[0:9]))

# Check for missing values dataset.isnull().values.any()

#pairploting import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline plt.figure() # setup marker generator mks = itertools.cycle(['o', '^', '*', 'v', 's', 'p']) markers = [next(mks) for i in dataset["Outcome"].unique()] sns.pairplot(dataset.drop("Glucose", axis=1), hue = "Outcome", height=3, m arkers=markers) plt.show() All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

# Extract rows of independant variables and target variable and the first 70 rows for training set and test set x = dataset[['Pregnancies','Glucose', 'BloodPressure', 'SkinThickness', 'I nsulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']].values y = dataset['Outcome'].values xtrain=x[:70] ytrain=y[:70] xtest=x[70:] ytest=y[70:]

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

# Split dataset into training set and test set xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.5, rando m_state=0) # 50% training and 50% test # If random_state is used for initializing the internal random number gene rator # which will decide the splitting of data into train and test indice s. # if it is not fixed to a value, the splitting could be different in each run # For many machine learning algorithms, it is important to scale the data # Fit the training set and test set sc = StandardScaler() sc.fit(train_inputs) x_train_std = sc.transform(train_inputs) x_test_std = sc.transform(test_inputs) x_combined_std = np.vstack((x_train_std, x_test_std)) y_combined = np.hstack((train_classes, test_classes)) k = 3 knn = KNeighborsClassifier(n_neighbors=k, p=2, metric='minkowski') knn.fit(x_train_std, train_classes) # Model Accuracy on training data and test data print('The accuracy of the knn classifier out of 1 is {:.2f} on training d ata'.format(knn.score(x_train_std, train_classes))) print('The accuracy of the knn classifier out of 1 is {:.2f} on testdata'. format(knn.score(x_test_std, test_classes)))

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

#Part B(ii) #Tree Classification import numpy as np import pandas as pd # Import Decision Tree Classifier from sklearn.tree import DecisionTreeClassifier # Import train_test_split function (to split the dataset into training and testing) from sklearn.model_selection import train_test_split #Import scikit-learn metrics module for accuracy calculation. from sklearn import metrics

#You may download the dataset on your computer and use the following to up load it in Colab from google.colab import files uploaded = files.upload()

#read the file df = pd.read_csv('Diabetes.csv') #see how it looks like print(df)

#check for missing values df.isnull().values.any() #define the predictor and target variables x = df[['Pregnancies','Glucose', 'BloodPressure', 'SkinThickness', 'Insuli n', 'BMI', 'DiabetesPedigreeFunction', 'Age']] y = df['Outcome']

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

# Split dataset into training set and test set xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.5, rando m_state=0) # 50% training and 50% test # If random_state is used for initializing the internal random number gene rator # which will decide the splitting of data into train and test indice s. # if it is not fixed to a value, the splitting could be different in each run

# Create Decision Tree classifier object (by default -- > unpruned tree) clf = DecisionTreeClassifier()

# Train Decision Tree Classifer clf = clf.fit(xtrain,ytrain) # define a vector containing the selected predictors attributes = ['Pregnancies','Glucose', 'BloodPressure', 'SkinThickness', ' Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

#to set the size of the output figure import matplotlib.pyplot as plt plt.figure(figsize=(50,50)) from sklearn import tree figure=tree.plot_tree(clf, filled=True, rounded=True, feature_names = attr ibutes, class_names=['0','1']); #Predict the response for test dataset ypred = clf.predict(xtest) # Model Accuracy print("Accuracy:",metrics.accuracy_score(ytest, ypred)) #Predict the response for test dataset ypred = clf.predict(xtest) # Model Accuracy print("Accuracy:",metrics.accuracy_score(ytest, ypred))

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

#Predict the response for test dataset ypred = clf.predict(xtest) # Model Accuracy print("Accuracy:",metrics.accuracy_score(ytest, ypred))

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

#using greed search to find the best parameters (for pruning) from sklearn.model_selection import GridSearchCV from sklearn.tree import DecisionTreeClassifier #defining the parameters to optimize grid_param = { 'criterion': ['gini', 'entropy'], 'max_depth': [2, 3, 4, 5, 6, 7, 10, 11]} gd_sr = GridSearchCV(estimator=clf, param_grid=grid_param, scoring='accura cy', cv=6, n_jobs=-1) gd_sr.fit(xtrain, ytrain) best_parameters = gd_sr.best_params_ print(best_parameters)

#best result on the validation best_result = gd_sr.best_score_ print(best_result) # Create Decision Tree classifer object based on the optimized parameters clf = DecisionTreeClassifier(criterion="gini", max_depth=2)

# Train Decision Tree Classifer clf = clf.fit(xtrain,ytrain)

# define a vector containing the selected predictors attributes = ['Pregnancies','Glucose', 'BloodPressure', 'SkinThickness', ' Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

#to set the size of the output figure import matplotlib.pyplot as plt plt.figure(figsize=(25,15)) from sklearn import tree figure=tree.plot_tree(clf, filled=True, rounded=True, feature_names = attr ibutes, class_names=['0','1']);

All coding from google colab are attached at the end. For clearer vison, can view that. Thank you.

ypred = clf...