2 2 data cleansing UKY tutorial explanation for how to use JMP PDF

Title	2 2 data cleansing UKY tutorial explanation for how to use JMP
Author	Bill bob
Course	Analyzing Business Operations
Institution	University of Kentucky
Pages	5
File Size	558.2 KB
File Type	PDF
Total Downloads	39
Total Views	133

Preview

CLICK TO PREVIEW PDF

Summary

UKY tutorial explanation for how to use JMP...

Description

Task 2 – Data Cleansing Answer the following questions to cleanse the data in “COVID Mortality.csv” acquired from Task 1: 1. What anomalies are detected and corrected during data cleansing? 2. Should all the anomalies identified above be cleansed in this case? Explain. 3. How do you perform the data cleansing that is needed in this case? Data cleansing is important to ensure the data acquired is accurate, complete, and unbiased by detecting and correcting four typical anomalies: duplication, errors, missing values, and outliers. To protect a patient’s privacy, you do not have identifiable information about each row of data. As such, data duplication simply means patients who died from COVID had many risk factors in common. Therefore, duplication is not a concern in this case. In other words, you only need to cleanse the data from (1) outliers, (2) missing values, and (3) errors. To ensure that you cleanse data that is relevant for achieving the goal of characterizing demographic risk factors of COVID mortality, you focus on the following five columns of interest, as noted in Task 1: • cdc_case_earliest_dt: serves as the basis to group data by day/week/month • sex: serves as one of the demographic risk factors • age_group: serves as one of the demographic risk factors • race_ethnicity_combined: serves as one of the demographic risk factors • death_yn: serves as the COVID mortality indicator The procedure to cleanse the data from (1) outliers, (2) missing values, and (3) errors in JMP is described step-by-step as follows: Step 1 – retrieve the data acquired from Task 1 Open the “COVID Mortality.csv” file downloaded in Task 1 with JMP: • File > Open • “COVID Mortality.csv” > Open (make sure the box “Use default program to open” is unchecked) OR • Right-click “COVID Mortality.csv” > Open with > JMP

1

Step 2 – cleanse the data from outliers You will use the Quantile Range Outliers method in JMP to detect outliers for numeric data. Since cdc_case_earliest_dt is the only numeric column among the five columns of interest, you will perform outlier detection for the “cdc_case_earliest_dt” column only as follows: • From the COVID Mortality window, select Analyze > Screening > Explore Outliers:

•

Select ““cdc_case_earliest_dt” and drop it to [Y, Columns], then click [OK]:

2

•

Select [Quantile Range Outliers] under Commands of Explore Outliers:

•

The resulting Quantile Range Outliers report indicated that there are no outliers in “cdc_case_earliest_dt”:

3

Step 3 – cleanse the data from missing and erroneous values JMP uses Columns Viewers from the Cols menu to detect missing and erroneous values in a column. Since there are five columns of interest (i.e., “cdc_case_earliest_dt”, “sex”, “age_group”, “race_ethnicity_combined”, and “death_yn”), you detect missing and erroneous values on all five columns by performing the following: • From the COVID Mortality window, select Cols > Columns Viewers:

•

Select the five columns simultaneously (hold down the [Ctrl]/[Cmd] key), then click [Show Summary]:

4

•

Click the [Distribution] button under Summary Statistics to see the distribution of the five variables:

Note that there is no missing nor erroneous value in the columns of interest. Step 4 – save your work • Save the clean data to “COVID Mortality” as a JMP file (File > Save). • Close all windows and exit JMP.

5...