Titanic PDF

Title	Titanic
Author	Shuja ur Rehman
Course	Introduction to Data Science
Institution	COMSATS University Islamabad
Pages	7
File Size	241.7 KB
File Type	PDF
Total Downloads	56
Total Views	137

Preview

CLICK TO PREVIEW PDF

Summary

Titanic dataset analysis...

Description

Titanic Disaster Introduction The dataset chosen for this assignment is Titanic dataset from Kaggle. The titanic disaster was one of the biggest disaster of its type. it happened on the maiden journey of this ship and main purpose of this task is to understand that why some of the passengers of this ship survived and most of them could not make to the land after the incident. There were total two thousand two hundred and twenty-four 2224 passengers and crew members on the ship and out of they only 722 survived killing 1502 passengers and crew members. This analysis is about the characteristics identification of the passengers, also the identification of the cabin, age, and their point of departure etc. and what is the relationship between the characteristics of the passengers and the chance of survival. Although there were some elements like luck involved in the survival of those 722 passengers but it is assumed that some of the groups pf the people in the ship were more likely to survive than the others. This task also builds a predictive model which is very much answering the question that what groups of people had more chances of surviving this major ship accident.

Datasets There are two types of date set used in this analysis. One is the train data and other is test data. The train data has the details of the 891 passengers and it contains many attributes like age, type of cabin etc. but most importantly it has the attribute of survival which has the information weather the passenger survived or not. And it will be the output attribute of the dataset as in the test data we will not have this attribute and we have to predict the value of this attribute. In test dataset we have all these attributes mentioned above in the train dataset but we do not have survived attribute as it is the output attribute and to check the model and its accuracy of predicting the result and we will find the value of this attribute in this dataset using the model built in training phase. As mentioned above in the dataset section the training dataset contains 892 rows and 12 columns. The first row of the dataset is the heading or attribute defining row in which the names of different attributes of the passengers are written. The first column is of the Passenger ID and it

is the unique for every passenger and the second column of the dataset is about the survival of the passenger weather he or she survived or not. The P Class attribute in the dataset is the column which tells the class of the ticket of the passenger. This whole dataset will be divided by ratio for test and train data. Many algorithms will be applied for the research purpose and other than these algorithms there will be other methods too.

Language used The language used for this analysis is Python an interpreted language which is a general purpose high level language and is very high level. It is mostly used for machine learning algorithms and it was designed to improve the code readability and it uses object oriented approach in order to help the programmers using this language write a clear code in order to do any project which is small or big. This language helped a lot in order to analyze the titanic disaster problem and get to the cause of the survival and not survival of the passengers.

Literature Review Many work is done on this topic and Kaggle held the competition on this dataset in which many people are participating. Many writers have also written report on this topic in which one writer stated that titanic remains the biggest tragedies which has occurred in the recorded history on the human. The objective of this papers are to apply the different machine learning algorithms and different deep learning algorithms which can help us in order to check that why and on what circumstances the passenger survived the disaster of the titanic and why most of them were not able to survived the tragedy. Different attributes which were included in the both of the data sets which are train dataset and test dataset and these attributes like id of the passenger, age of the passengers, class pf the passengers and Fare of the ticket of the passengers. After getting the results of all the algorithms which were calculated they are were analyzed in order to get the better view of the scenario and to better understand the prediction methodology of the model. [ CITATION Try17 \l 1033 ]

Methodology To analyze the dataset first different libraries were installed which includes NUMPY, PANDAS, MATPLOYLIB, and SEABORN etc. after the installation of these libraries both of the dataset

files which are train dataset and test dataset saved as Train.csv and Test.csv were read and their data was printed. The screenshot of the data in the dataset is given below:

Figure 1, screenshot of the test data after reading the dataset

After reading the data set the working is on the train data and in the next step the Passenger ID was appended in order to better get the data and understand the data. And then this attribute i.e. Passenger ID was set as index column. After that, missing number library was imported in order to get the missing values in data set and it was used in the train data set which shows the graph of the missing values in the set.

The screenshot of the missing values graph in the data set is given below:

Figure 2, screenshot of the missing values as graph in the data set

After that Imputer was imported from Sklearn and then it was used on the dataset. And after doing some other calculations the survival ratio of the passengers according to the fare class were calculated and printed as the graph which is given below:

Figure 3, graph of survived passengers as per the fare of the ticket

After that the survival ratio of the passengers was calculated and printed in the form of graph which is given below:

Figure 4, survival ratio of the passengers according to their sex

This graph shows the survival ratio of the passengers according to the gender of the passenger and in this graph there are two colors in which one of them shows male and other shows the female ratio. According to age the graph shows the different values which is given below:

Figure 5, survived and not survived graph for the passengers of the titanic

After doing the analysis of the data different algorithms were applied on the train data in order to build the model and train it. And when the model was trained on different data sets it was tested on the test data and it predicted the result for the passengers if he or she survived or not and its accuracy was then calculated. After testing the model on the test data it was then implemented for the application phase but this is not done in the task as it was not required.

Decision making Using this data set we can predict the case of the not surviving of the passengers of the titanic disaster. Using the techniques of machine learning in the python language we can train the model on different machine learning and deep learning algorithms and by this we can do the prediction of survival or not survival and can tell the cause like age, sex, cabin, class etc. which helped the passengers in surviving. Using this data set and this trained model we can also do the prediction

on other passengers which are unknown in this dataset and can predict the survival result of the passenger whose data is given.

Conclusion and Recommendations Titanic was one of the biggest tragedy in the history of the human. This incident created many questions for the researchers like why many passengers were not able to survive, why the life boats were not enough for all the passengers and crew members and what is the cause due to which the passengers survived like was the class of the passengers an issue which created the gap between upper class and lower class passengers and helps the upper class passenger to get the life boats and lower class passenger were not able to get the life boat or gender or age became the point of survival. To get the answer of these questions this analysis is done and it shows the cause of the surviving of the passenger not surviving. This was also done in order to predict the survival of the passengers according to his or her attributes. The dataset was good but it can be made better if the availability of life boat and access to the life boats was added in the dataset as what type of the passengers had the access and what had the easy access to the life boats and how many life boats were available.

References Barhoom, A. M. (2019). Predicting Titanic Survivors using Artificial Neural Network. Chatterjee, T. (2017). Prediction of Survivors in Titanic Dataset: A Comparative. Cicoria, S. (2014). Classification of Titanic Passenger Data and Chances of Surviving the Disaster. Ekinci, E. O. (2018). A Comparative Study on Machine Learning Techniques using Titanic Dataset. Singh, A. (2017). Analyzing Titanic disaster using machine learning algorithms....