UCS Finalized PDF

Title UCS Finalized
Author Smith Robert
Course banking
Institution Universiti Teknologi MARA
Pages 28
File Size 1.4 MB
File Type PDF
Total Downloads 58
Total Views 142

Summary

Revison...


Description

1.0 INTRODUCTION RapidMiner is a data science software platform developed by the company of the same name that provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics. It is used for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the machine learning process including data preparation, results visualization, model validation and optimization. This RapidMiner also used for the data scientist to makes predictive analytics in the future. According to Bloor Research, RapidMiner provides 99% of an advanced analytical solution through template-based frameworks that speed delivery and reduce errors by nearly eliminating the need to write code. RapidMiner unifies the entire data science lifecycle from data preparation to machine learning to predictive model deployment. Next, the RapidMiner have several advantages which is the RapidMiner offers numerous procedures especially in the area of attributes selection and for outlier detections. Also, has the full facality for model evaluation using cross validation and independent validation sets. RapidMiner also offered over 1,500 methods for data integration, data transformation, analysis and modelling as well as visualizations. However, the RapidMiner is limited in partitioning abilities for dataset to training and testing sets. This tools only suited for people who are accustomed to working with database files such as in academic settings or in business settings. The reason for this is that the software requires the ability to manipulate SQL statements and files. RapidMiner Studio provides an intuitive GUI client that enables users to design code-free analysis processes. It is a tool created for data mining, with the basic idea, that the analyst does not require to have good programming skills. To make the data mining process more transparent and smooth, it has a good set of predefined operators solving a wide range of problems. It helps users more easily explore, blend and cleanse data, as well as build and validate models. Also, give users the ability to access a comprehensive list of data sources and data transformation and visualization methods.

2.0 DATA PREPARATION Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It is important step and often involves reformatting data, making corrections to data and the combining of data sets to enrich data. In our data preparation, it includes the selection and understanding of dataset and data cleaning. 1

2.1 Dataset Dataset is a collection of data. It corresponds to one or more database tables, where every column of a table represents a particular attributes and each row corresponds to a given record of the dataset in question. We are using the Titanic dataset which can be found in the repositorysample-data of the RapidMiner Studio. This dataset consist of twelve attributes; passenger class, name, sex, age, number of sibling or spouses on board, number of parents or children on board, ticket number, cabin, passenger fare, port of embarkation, life boat and survived. 2.2 Data cleaning and transformation Data cleaning is the process of detecting and correcting or removing corrupt or inaccurate records from a record set, table or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the unwanted records. While data transformation will reorder the data according to the desired order and divide the data into groups of particular size, frequency, users or binning.

Figure 2.2: Process of data cleaning and transformation Figure 2.2 shows the process of cleaning and transforming the data, leaving only the desired attributes and in the simplest order for further evaluation. Firstly, the desired attributes are chosen by applying select attributes operator. Secondly, missing values are removed from the chosen attributes by applying filter example operator. Next, survived and passenger fare attributes are been set as target role using set role operator. We then use split data operator to create partitions for the data for train and test. As the arrangement of the age attributes is not in

2

order, we use sort operator to rearrange it in increasing order. Lastly, to divide the attribute that consist of nominal data into groups, we use discretize operator. Table 2.2: List of Operators Operators

Description This operator selects a favourable subset of attributes of an example set to explore and removes the other attributes. In this case, we select age, sex, passenger class, survived and passenger fare out of all the attributes in the Titanic dataset.

This operator selects which examples of an example set are kept and which examples are removed. From the selected attributes of the Titanic dataset, we removed all the missing values by selecting ‘no missing attributes’ parameter.

This operator is used to change the role of one or more attributes. We set survived and passenger fare as the target attributes for learning operators.

This operator sorts the input example set in ascending or descending order according to a single attribute. We arrange the age attribute in ascending order for the ease of exploring the data later.

This operator discretizes the selected numerical attributes into specified classes. The selected numerical attributes will be changed to nominal attributes. We discretized the age attribute into four classes; children (0-12), teenagers (13-20), adults (2160) and senior citizens (60-100). 3

2.3 Data selection Data selection is a process of determining the appropriate data source and its include select the most suitable attributes to be explore. The main concern is to excluding data that is not supportive in the process such as in the Titanic dataset we did not use cabin, name, ticket name, and life boat to make a data exploration. In order to do data selection is by using select attributes operator which is provided in Rapidminer. Table 2.3: Attribute Description Attributes

Description

Statistic

On April 14, 1912, Titanic incident has occur when the ship struck an iceberg

and

sank

off

at

Newfoundland Coast in the North Atlantic ocean. Based on the data distribution there were 732 people Survived

aboard but the number people who is survived is 299 and not survived due to this incident is 433. Fare is the payment that will be receive by individual or company when

Passenger fare

someone

use

their

transportation. Based on the data distribution, the highest fare that has been paid is $582 and follow by $92, $27, $3, $14, and the lowest in $8. There are two categories in gender which is female and male. Based on the tree map we can see that male

Sex

have a greater number of passenger than female that has been ride Titanic which is male 465 and female is 267.

4

Passenger class is a quality of accommodation on public transport.

The

accommodation

could be a seat or a cabin for example. Higher travel classes are designed to be more comfortable Passenger

and are typically more expensive.

class

From the data distribution it consist of three class in Titanic ship which is the total number of passenger having the first class in 200, the second class is 183 passenger, and the third class is 349 passenger. Based on our finding about how old passenger that are survived, its divided into four types of group which is, for children is from 1-12, teenager between 13-20, adults between 21-60 and lastly from

Age

senior citizen is from 60 and above. These four group will give different result to our finding.

3.0 DATA EXPLORATION 3.1 Survival Table 3.1: Number of Survival Passenger Index

Nominal Value

Absolute Count

1

No

433

2

Yes

299

Table 3.1 shows the number of survival passenger in incident of titanic. The titanic event happened start with Titanic hit the iceberg and the emergency signal not arrive at California. 5

Titanic finally sank in the water and many passengers who trying to survive in the incident. Table 3.1 showed the person who are survived on the incident. The survival is 299 and the unlucky passenger is 433.

3.1.1 Comparison of survival based on age Table 3.2: Fractional table for age Index Age 1

Child

Absolute count 74

2 3 4

Teenager Adult Senior Citizen

105 530 23

No.of survival 44

Fraction (%) 59.46

No. of non survival 30

Fraction (%) 40.54

37 212 6

35.24 40 26.09

68 318 17

64.76 60 73.91

Figure 3.1: Comparison of survivor based on age bar chart

Based on Figure 3.1, it illustrates the relationship of the number of people survived based on the categorization of age. The age of the people is categorized into 4 which is, child, teenager, adult and senior citizen. While yes and no represent the number of people survived and not survived. Looking at the graph we can see that majority of passengers did not survive, along with the ages of the passengers on-board the titanic. It also shows that the majority of people on-board were adult which around the ages of 21-60, and the oldest person being in the senior citizen, which is 60 and above. the results reveal that child has the highest percentage of survival among all the other categorise of age while senior citizen has the lowest percentage of survival. 74 of the child aboard the titanic, 44 or 59.46% survived. 105 of the teenager on-board, 37 or 35.24% survived. 6

530 of the adult on the ship, 212 or 40% survived. lastly, 23 of the senior citizen aboard, 6 or 26.09% . We can assume that child were clearly the passengers of choice to save.

3.1.2 Comparison of survival based on sex Table 3.3: Fractional table of Survival based on Age Index Age 1 2 3 4

Child Teenager Adult Senior Citizen

Absolut e count 74 105 530 23

No.of survival 44 37 212 6

Fraction (%) 59.46 35.24 40 26.09

No. of non survival 30 68 318 17

Fraction (%) 40.54 64.76 60 73.91

Figure 3.2: Bar chart of Survival based on Age

Figure 3.2, above shows the comparison of number of survivors based on gender. The overall sample above is 732, it is consisted of 465 from male and 267 from female survivor. There are 40.84% of them survived and 59.15% didn’t manage to survive. It shows that 74.15% of female has survived compare to male 24.72% of them. This indicated that female is more prone to survive from the Titanic incident. This may cause by several factor which the first one is gender factor. Where it’s human nature, male priorities woman during the emergency alert. Other factor is the relationship factor which divided by two 7

either family or an affection of one to another. The logic here is the passenger went for titanic adventure with their love one or with their family. Thus, female is the highest survivor. 3.1.3 Comparison of Survival based on passenger class Table 3.4: Fractional table for passenger class Index Passenger Class 1 2 3

First Second Third

Absolute count 200 183 349

No. of survival 124 83 92

Fraction (%) 62 45.36 26.36

No. of non survival 76 100 257

Fraction (%) 38 54.64 73.64

Figure 3.3: Comparison of survivor based on passenger class bar chart The classification refers to social class of every passenger. Figure 3.3 above shows the

comparison of survival based on passenger class classification. There are three part of passenger class which consist of first class, second class and third class. However, the passenger class divided into two section which are survived and non-survived. The total sample from the data is 732 and there are 40.84% of them survived and 59.15% didn’t manage to survive. Bar chart above shows the highest numbers of survived from the titanic ship is the first class, followed by third and second class. Meanwhile the highest number of passengers didn’t manage to survive is the third class and followed by the second and third. Based on the number of survived from bar chart above, the first class passenger is the highest survival from Titanic ship. This indicates that social class or passenger class has positive relationship to factor of survival. If it’s analysed class by class in comparing with numbers of non-survived in percentage the first class come in 8

highest percentage (62%), the second class come in second (54.64%) and the third class come in last (26.36%). This indicate that during the emergency situation shows that priorities is most on social class as what stated in survived (%). 3.2 Passenger Fare Passenger fare is the fee paid by a passenger for use of a public transport system. Fare structure is the system set up to determine how much is to be paid by various passengers using a transit vehicle at any given time. For instance, road transport, sea transport and air transport. In this study, the titanic data consists the passenger fare attributes to make it easier to analyse the data based on the customer’s age, gender and passenger class. After doing the observation of sample titanic data using the RapidMiner, we found out that the lowest fare for the titanic is $0, while the highest fare is $512.329. 3.2.1 Relationship of passenger fare in terms of age Table 3.5: Passenger fare in terms of age Index

1

2

3

4

Age

Child

Teenager

Adult

Senior Citizen

Maximum ($)

151.55

263

512.33

263

Upper quartile ($)

31.39

21.63

40.69

75.25

Median ($)

25.08

9.84

16.05

26.55

Lower quartile ($)

15.75

7.85

8.05

9.69

Minimum ($)

3.17

4.01

0

6.24

9

Figure 3.4: Passenger fare in terms of age using boxplot chart Based on Figure 3.4, it shows the result and the shape of the diagram for the passenger fare in term of age using boxplot diagram. The age was divided into four categorize which is child in range of age 0-12 years old, teenagers in range 13-20 years old, Adults form 21-60 years old and last categorized is senior citizen in range of age 60 and above. The result that been shows in that figure are in term of their maximum, upper quartile, median, lower quartile, and minimum for the passenger fare based on their age. This diagram also shows the different result that passenger pays for the fare through the shape of the boxplot. Therefore, through the analysis that done by machine learning model tools which is RapidMiner, we can see the trend of passenger fare in quantity.

This is the child boxplot in range of age 0-12 years old. Based on the boxplot it shows the result for maximum and minimum in this category which is $151.5 and $3.2. Next, the different of the upper quartile and the lower quartile between $32.6 and $18.8 and this will affect the shape skew to bottom. The median of the passenger fare in this age is $28. Therefore, this will lead to the same amount for the average children fare and the fare centre for child passengers which is $28.

10

Next, this is the boxplot for teenagers in range age from 1320 years old. In this category, they record $263 for the maximum and $4 for minimum passenger fare. For the upper and lower quartile, respectively give result $26.1 and $7.9 and this effect the shape for the boxplot since it records the lowest quartile among other categories. The average for the teenager’s fare is $10.5 since the median in the boxplot record this amount. This is the boxplot for the senior citizen in range of age 60 and above. This category it records same amount for the maximum price with teenagers with $263 and record $6.2 for the minimum amount. Senior citizen shows the bigger boxplot for their categorize compare to other since it records the higher upper quartile with $79.5 while the lower quartile is $10.96. The median price for this category is $32.9 and this is the average between the lower and upper quartile. This is the adult’s boxplot in range of 20-60 years old. This category records the highest maximum and lowest minimum amount among other categories with $512 and $0 respectively. The boxplot shape skew to the bottom since the lower and upper quartile between $8.1 and $41.1. In addition, the centre of passenger fare for adults and the average adults fare is $15.9 since the amount for the median is $15.9.

3.2.2 Relationship of passenger fare in terms of sex Table 3.6: Passenger fare in terms of sex Index Sex

1 2

Male Female

Maximu m ($) 512.33 263

Upper quartile ($) 29.7 59.4

Median ($) Lower quartile ($) 13 7.90 26 13

Minimum ($) 0 6.75 11

Figure 3.5: Passenger fare in terms of sex using boxplot chart Table 3.6 above shows the result of the maximum, upper quartile, median, lower quartile and minimum of the passenger fare by sex. The amount from the Table 3.6 above is from the box plot which is in Figure 3.5, passenger that paying the fare in term of sex. We can see how is the trend of passenger fare.

The median of the boxplot female is $26, which means the centre of range price of fare passenger is 26. The maximum and minimum of the fare price is $263 and $6.75 respectively. The skewness of the female boxplot is skewed to bottom since the highest number purchased fare price is between $59.4 to $13. The size of boxplot female is bigger compared to male because of the upper and lower quartile is bigger. The median of male passenger fare is $13, which means the centre of fare price by male is $13. The maximum of the fare price is $512.33 and the minimum $0. The skewness of male passenger fare price is skewed to bottom. Since the range upper quartile and lower between $29.7 to $7.90. Smaller boxplot since it upper and lower quartile has small range. 12

3.2.3 Relationship of passenger fare in terms of passenger class

Figure 3.6: Passenger fare in terms of passenger class using boxplot chart

13

Table 3.7: Passenger fare in terms of passenger class

Index

1

2

3

Passenger Class

First

Second

Third

Maximum ($)

512.33 73.5

69.55

26

15.25

18.75

8.05

13

7.76

9.69

0

Upper quartile ($)

Median ($)

Lower quartile ($)

Minimum ($)

108.9

69.3

33.63

0

Table 3.7 above shows the result of the maximum, upper quartile, median, lower quartile and minimum of the passenger fare by passenger fare. The amount from the Table 3.7 above is from the box plot which is in Figure 3.6, passenger that paying the fare in term of passenger class. We can see how is the trend of passenger fare.

Based on Figure 7, the median of first class passenger fare is $69.3, which means the centre of fare price by first class passenger is $69.3. The maximum of the fare...


Similar Free PDFs