Assignment 1, Practical data Science with Python PDF

Title	Assignment 1, Practical data Science with Python
Course	Practical Database Concepts
Institution	Royal Melbourne Institute of Technology
Pages	6
File Size	502.3 KB
File Type	PDF
Total Downloads	54
Total Views	155

Preview

CLICK TO PREVIEW PDF

Summary

This task consists of a file titled “Automobile.csv”. This Automobile Dataset consists of 26 various characteristics of an auto. The first task is to clean the data by using the Python programming language. The process begins with importing the dataset into python environment through the python pack...

Description

Assessment 1: Data cleaning and Summarising

Task 1 - Data Preparation Introduction This task consists of a file titled “Automobile.csv”. This Automobile Dataset consists of 26 various characteristics of an auto. The first task is to clean the data by using the Python programming language. The process begins with importing the dataset into python environment through the python package. Data Preparation Steps:

1. Data Retrieving Initially I loaded the CSV file into the Jupyter notebook. After loading the file, I noticed that the file does not contain any header. As such appropriate headers are added. To learn about the distribution of data in each column I examined the statistical summary of each column. The statistical measures will reveal whether there are any mathematical problems, such as typos, extra whitespaces, sanity checks for impossible values, and missing values, etc.

2. Data type I applied the following code, “Automobile.dtypes”, to identify the data types. I noticed that the file contains 238 rows and 26 columns. Data types are float, integer and object.

3. Typos By checking data types I noticed that ten out of 26 data types are object (string) data type. I applied the following code for Make column: Automobile['Make'].value_counts() and noticed we have some typos in Car’s name. I fixed typos by using replace scripts. Below is an example of fixing typos for the “make” column. Automobile['Make'].replace('vol00112ov', 'volvo', inplace=True) Automobile['Make'].replace('peugot', 'peugeot', inplace=True) Automobile['Make'].replace('alfa-romero', 'alfa-romeo', inplace=True)

There were three different typos in the “make” column including “Volvo”, “Peugeot”, “AlfaRomeo” incorrectly written as “vol00112ov”, “peugot”, and “alfa-romero”. In addition, to fix any other typos, I applied same code for other object type data including Fuel-type, Aspiration, Num-of-doors, Body-style, Drive-wheels, Engine-location, Engine-type and Num-of-cylinders.

4. Extra-whitespaces/ Upper/Lower-case I managed to write one line code to remove Extra-Whitespaces and convert all cells to UpperCase for all object type cells in the dataset. Below is the code, which is applied:

Page 1 of 6

Assessment 1: Data cleaning and Summarising “Automobile["X1"] = Automobile["X"].str.strip().str.upper()” To remove any further “extra-whitespaces”, I applied the same code for other object type data including Fuel-type, Aspiration, Num-of-doors, Body-style, Drive-wheels, Engine-location, Engine-type, and Num-of-cylinders.

5. Sanity checks For Sanity check, firstly I applied the following code to observe the content of each column Automobile['X'].value_counts(). Then I spotted impossible values for each attribute. One example in this dataset is the “Symboling value” that needs to be between -3 and +3. I observed that we have +4 in the dataset, and I replaced it by -3. In addition, in ‘Engine-type’ we have letter L (in lowercase). This is not a known type for engine. I replaced this with “ohc” which has the most frequency. This was same for Num-of-cylinders, for which we had the following values of five, twelve, two and three. I replaced all by four which has the highest frequency.

6. Missing Values Here I was dealing with data that was both numerical and categorical. •

Numerical data

Missing values in numerical data is replaced with the Mean or Median of the values in those columns. For example, to replace the missing values, I used the mean of the values in the following columns including Normalised-losses, Bore, Stroke, Horsepower, P eak-rpm, and Price. This is done by calculating the mean for every numerical column after excluding the nan-numerical data and then replacing the missing data with the mean. •

Categorical data

Missing values in categorical data is treated using the Mode value. Mode value is the value that appears most frequently in a dataset. For example, number of doors (Num-of-doors) missing values are treated by using the Mode value. It has been explained in Sanity check’s part. •

Detecting, Analysing, and treating outliers

The normal range for Normalised losses in this dataset should be between -3 and 3. However, in this dataset we have 4. This is an example of outliers. This outlier is replaced by -3. Another example for outliers was that the price of cars was zero (0). This is replaced by the mean of the values.

1

X stands for the column name Page 2 of 6

Assessment 1: Data cleaning and Summarising Task 2 - Data Exploration Column with nominal values

Figure 1: Frequency of the car prices

Figure 1 indicates the frequency distribution of car prices in the dataset. This histogram is right skewed which means that car prices largely range between $5,000 and $9,000. The lowest car price in the dataset is $5,000 and the most expensive car is around $45,000. Average price of the cars in the sample is around $13,000. Price is chosen here because it is a nominal variable. Column with ordinal Values

Figure 2: Distribution of number of cylinders the engine has.

Figure 2 displays the distribution of number of cylinders within the sample cars. According to this pie chart the majority of the sample cars are four-cylinder engine ones, at around 77%. This is followed by 20% six-cylinder and 2% eight-cylinder cars. Number of cylinders is chosen because it is a categorical variable. Page 3 of 6

Assessment 1: Data cleaning and Summarising Column with numerical values

Figure 3: Density of Miles Per Gallon for City-drive.

Figure 3 illustrates the distance that a car can travel per gallon of fuel. In other words, this density plot is a representation of a car’s fuel efficiency, and in this case has a bimodal distribution. According to figure 3, miles per gallon for city-drive ranges between 13 to 49 miles, with an average of 24 miles. Miles per gallon is chosen here because it is a numerical variable. Correlation between the engine-size and price

Figure 4: Correlation between engine-size and price

In figure 4, I have used a scatterplot to display the association between engine-size and the price. I have also added a linear line to this scatterplot. The linear line is also called the regression line. The regression line in figure 4 indicates that there is a positive linear association between engine-size and the car price.

Page 4 of 6

Assessment 1: Data cleaning and Summarising Correlation between the drive-wheels and price

Figure 5: Correlation between drive-wheels and price

Figure 5 shows the association between drive-wheels and the price. I have used the “boxplot” script to categorise drive-wheels into three categories including rear-wheel, front and fourwheel drives. Figure 5 illustrates that rear-wheel drive vehicles are, on average, the most expensive cars in our sample. However, 4-wheel and front-wheel drive cars have approximately the similar prices. Correlation between the horsepower and price

Figure 6: Correlation between horsepower and price

Figure 6 illustrates the association between horsepower and the price. The regression line in figure 6 indicates that there is a strong, positive, and straight association between horsepower and the price. Cars with higher horsepower tend to have higher prices.

Page 5 of 6

Assessment 1: Data cleaning and Summarising Scatter matrix for all numerical columns

Figure 7: Scatter matrix for all numerical columns

Figure 7 provides an scatter matrix for numerical variables within the dataset. References Mediasittich. (2019). Linear regression for car price prediction. Retrieved from https://www.kaggle.com/mediasittich/linear-regression-for-car-price-prediction Automobile Data Set. (n.d.). Retrieved from https://archive.ics.uci.edu/ml/datasets/Automobile Shubhamsinghgharsele. (2018). Exploratory Data Analysis on Automobile Dataset. Retrieved from https://www.kaggle.com/shubhamsinghgharsele/exploratorydata-analysis-onautomobile-dataset

Page 6 of 6...