MAT71004 Tutorial Worksheet - Week 4 Data Cleaning - S2-2020 PDF

Title MAT71004 Tutorial Worksheet - Week 4 Data Cleaning - S2-2020
Course Introduction To Evidence For Health Science Practitioners
Institution Southern Cross University
Pages 9
File Size 259 KB
File Type PDF
Total Downloads 106
Total Views 131

Summary

Download MAT71004 Tutorial Worksheet - Week 4 Data Cleaning - S2-2020 PDF


Description

Introduction to Evidence for Health Science Practitioners

Week 4 Tutorial Data management in Excel TUTORIAL TASKS CHECKLIST ☐ ☐ ☐ ☐ ☐

Part 1: Background briefing Part 2: Save the Excel workbook Part 3: The structure of the data Part 4: Clean the data in Excel Part 5: Save your work and log off

1. Background briefing We collected some data in Week 1 that we now need to analyse. You may recall that we collected demographic data (e.g. gender, age) and some survey data (Kessler Psychological Distress Scale). In today’s tutorial, we will look at the quantitative data that was collected and start to get a sense of what it all means. The survey was hosted on an online survey platform called Qualtrics. This is a widely used platform and one that you may well use at some point in the future. It is free to use and is supported by the SCU Service desk. This gives some quality assurance of the integrity of the service, particularly with regards to the privacy and protection of the participants’ data collected. As we learned last week, there are ethical and legal requirements on all data collected for research purposes. Qualtrics systematically collects survey data and allows it to be downloaded in several formats. The survey data we collected has been downloaded into an Excel spreadsheet and the data is presented in number format. This means that all the information (including words) has been converted to numbers. We need to understand how Qualtrics has allocated numbers so that we can interpret the data correctly. This is all part of the process of data management. Microsoft Excel is a very useful program and is worth getting to know. Whenever you need to do a table, consider doing it in Excel rather than Word, as Excel can perform all kinds of additional functions, such as sorting it into alphabetical order or adding up columns or rows. Tables can be easily copied and pasted into Word from Excel (and vice versa). Similarly, data spreadsheets can be easily uploaded to most statistical software packages from Excel, making Excel the program of choice for collecting and managing data. Data cleaning refers to checking and double checking the veracity of the data, making sure there are no mistakes and that the data is in the correct format. Mistakes can be made in manual input, even from a survey where the data was collected by a computer. It was still put in by humans, so there can be errors of judgment. We need to eyeball the data to check that it is OK at every stage of preparing the data for analysis.

Page 1 of 9

Introduction to Evidence for Health Science Practitioners

2. Saving the Excel data sheet On your computer (or preferred storage location) create a new folder within an Introduction to Evidence folder - perhaps call it Data. Go to the Tutorial Week 4 folder on Blackboard and find the file called Week 4 data. Save this as the original data and preserve a copy so you have the original file in case you wish to cross check whether you have made a mistake later, e.g. data_0.xlsx. Then navigate to this file and open it. Save it as another file, e.g. data_1.xlsx. This will be our working copy.

3. The structure of the data In Excel the data are organised into rows and columns. Each participant or ‘case’ has a row of all their data. Each column is a different variable, e.g. the first column is ID. Note the variable names are sometimes trimmed down to have no more than 7 or 8 characters. See if you can identify what each of the variables represents. You may need access to the original survey questions (a Word version of the survey is available on the Blackboard site in the Tutorial Week 4 folder). In each of the cells of the spreadsheet we have the observations. Each row contains all the observations from one person (or case) in the sample. Collectively, all these observations are called data. One observation (i.e. in one cell) is called a datum, but is usually just referred to as an observation. Data is actually a plural term, so that it is grammatically incorrect to say “this data”; rather, we say “these data” or “this datum/observation”. ID: This is unique number given to each participant in order of survey enrolment. It is an anonymous coding number, such that it contains no link to the identity of the person. From here on in, each individual participant is referred to by the ID number. As each participant has one number, by scrolling down to see the last number, we can see how many cases we have in the spreadsheet. This is called the sample size, denoted by n, for number. What is the sample size of for this data? 334 Degree: Notice how a number has been assigned to each category. Note the tab at the bottom of the Excel spreadsheet called Key. This is a ‘code book’ that you can use to understand how the non-numerical data has been reformatted as numbers. Check the codes for each degree. What number was your degree allocated to? Age: Check the data are all in number format. Do all the numbers all seem correct? Is there any data missing? Is there anything in this column that is not a number? Make a mental note that this will need to be corrected before we can work with it. Gender: Here we have four categories: females, males, diverse and prefer not to respond to this question. The numbers represent the categories. Check the Key tab to see what categories the numbers represent and fill in the table below. Code

Label

1

Female

2 3 4

Page 2 of 9

Introduction to Evidence for Health Science Practitioners

Relationship status: Participants were asked if they were in a committed relationship and were given four options to choose from. Note that Qualtrics has again used some number coding: 4, 6, 7, 8. Complete the table by entering the codes for the following categories: code

label

4

Yes No It’s complicated Prefer not to answer

Heart rate (HR): This column should only contain numbers. Run your eye down the spreadsheet - are there any invalid data in there? We will have to remove all of the text as part of our data cleaning shortly. Activities Participants were asked how many hours in the previous week they spent doing: 

A relaxing hobby (Hobbies_1)



Physical activity for health/fitness (Hobbies_2)



Physical activity for fun (Hobbies_3)

As before, notice any non-numeric data that will have to be removed during data cleaning. Kessler 10 Scale of Psychological Distress (K10) The Kessler 10 Scale of Psychological Distress (K10) (Kessler et al., 2002) is a valid and reliable measure of psychological distress. It has been used in other student populations and is routinely used by the government to monitor the risk of mental health disorders in the general population. By using a standardised scale such as this, we will be able to compare the findings from our sample with those from other studies. The K10 scale has 10 items (questions) and the participants answer these using a response scale with 5 categories ranging from 1 = none of the time to 5 = all of the time. This type of response scale is often referred to as Likert scale.

Page 3 of 9

Introduction to Evidence for Health Science Practitioners None of the time (1)

A little of the time (2)

Some of the time (3)

Most of the time (4)

All of the time (5)

On our Excel spreadsheet the participants’ responses are labelled K10_1 through to K10_10. If you did not undertake the survey, read through the questions to get a feel for the sort of information being asked and to understand what the numbers in the cells represent. Connor Davidson Resilience Scale (CD-RISC-10) The Connor Davidson Resilience Scale (CD-RISC-10) (Campbell-Sills & Stein, 2007) is a 10-item scale that measures resilience or how well one is equipped to bounce back after a stressful event, tragedy or trauma. Similar to the K10, CD-RISC-10 utilises a Likert scale with 5 response categories but this time the scoring for each question ranges from 0 = Not true at all to 4 = true nearly all the time. Not true at all (1)

Rarely true (2)

Sometimes true (3)

Often true (4)

True nearly all the time (5)

On our Excel spreadsheet the participants’ responses are labelled CD-RISC-10_1 through to CD-RISC10_10. A Word version of the survey is available on the Blackboard site in the Tutorial Week 4 folder.

4. Cleaning the data in Excel We will now systematically check each column in the spreadsheet as part of the data cleaning process. Using the Sort function in Excel makes this more efficient as missing or non-numeric data is moved to the bottom of the column, making these cells easier to identify. Using the Sort function in Excel

Page 4 of 9

Introduction to Evidence for Health Science Practitioners Degree: We need to check that all the numbers in this column are valid. 1.

Sort column from smallest to largest using the Sort & Filter function from the Editing group in the Home tab.

2.

What is the smallest value?

3.

What is the largest value?

4.

Check the Key, are these values within the valid range?

Age: Are all the data in this column numbers? Are the numbers within a plausible range? 1.

Sort this column from smallest to largest.

2.

Run your eye down the column.

3.

How many cases contain non-numeric data? We now need to make some decisions on how to deal with the invalid data. Where there are only a few cells with non-numeric data we can simply click in the cell and delete the words. But if there are many cells with non-numeric data there is a quick way to delete them all at once:  Highlight the whole column.  Click CTRL + F to bring up the Find and Replace function.  In the Replace tab, type the word you want to remove (e.g. years) in the Find line and then in the Replace line hit the space bar to denote a blank.  Click Replace All.

Reflect - why will these be a problem if we leave them as they are? 4.

Now the non-numeric data has been removed, re-sort this column from smallest to largest.

5.

What is the minimum (youngest) age?

6.

What is the maximum (oldest) age?

7.

Does this age range seem plausible? Where they are obvious mistakes, we can remove them and treat these cells as missing data.

8.

How many cases have missing data for this variable?

Page 5 of 9

Introduction to Evidence for Health Science Practitioners Gender: As we did for age, check the data by sorting it from smallest to largest. Are all the numbers in this variable valid? ☐Yes

☐No

How many cases have missing data is in this variable? Relationship status: Check the relationship variable using the same systematic process as before, sorting the data from smallest to largest. Are there any strange or unusual data in there? Is there much missing data?

☐Yes

☐No

Heart rate: Are all the data in this column numbers? Are the numbers within a plausible range? 1.

Sort this column from smallest to largest.

2.

How many cases contain non-numeric data? As with the age variable, where there are only a few cells with non-numeric data we can simply click in the cell and delete the words. But if there are many cells with non-numeric data it’s best to use the shortcut method to delete them all at once:  Highlight the whole column.  Click CTRL + F to bring up the Find and Replace function.  In the Replace tab, type the word you want to remove (e.g. bpm) in the Find line and then in the Replace line hit the space bar to denote a blank.  Click Replace All.

3.

Now the non-numeric data has been removed, re-sort this column from smallest to largest.

4.

What is the minimum (smallest) heart rate?

5.

What is the maximum (greatest) heart rate?

6.

Does this heart rate range seem plausible? Hint: For adults, a normal resting heart rate ranges between 60 and 100 beats a minute. People who are very physically active, such as athletes, may have resting heart rates around 40 beats per minute (healthdirect, 2020).

7.

Are there any remaining cases that contain non-numeric data?

8.

How many cases have missing data for this variable? Page 6 of 9

Introduction to Evidence for Health Science Practitioners Activities: The activities data are provided in the Hobbies columns. Using the same systematic process for checking and cleaning the data as you did for the other variables, assess each column for plausibility and remove any non-numeric data. Relaxing hobby

Physical activity for health/fitness

Physical activity for fun

1. What is the minimum (smallest) number of hours? 2. What is the maximum (greatest) number of hours? 3. Does this number of hours range seem plausible? 4. How many cases have missing data for this variable?

K10 items There are 10 items labelled K10_1 through to K10_10. Sort each item by smallest to largest and make sure that every item contains only numbers between 1 – 5. Keep an eye out for missing data. Did any of the K10 items contain missing data (i.e. blank cells)? Dealing with missing data in the K10 How should we deal with cells that are blank when they should contain data? There are many differing opinions on this and many articles and books have been written with detailed advice on how best to proceed in any given circumstance. Generally speaking, the way you deal with missing data will depend on why the data is missing. Missing data is generally classified as either (1) missing at random, or (2) not missing at random. If it is missing at random it is not really a problem. There could be just one cell missing every now and then and looks as though the participant accidently missed a question. They may not have pushed hard enough on the computer button or they may have skipped a question without realising it. This can be minimised by checking that the survey has been completed properly if you are using paper surveys or by computer settings that alert the participant if they haven’t answered a question. Data that is not missing at random can be a problem. If it is not a random process then there is some systematic reason that the data are missing. If this is not recognised, it could lead to systematic bias in interpreting the data. The guiding principles with how to deal with missing data should be that you need to preserve as much of the data as you can. You have an ethical obligation to the participants who freely gave their time and data so that they could help you to advance knowledge in some way. They may have missed a question or decided not to answer it for their own reasons. We shouldn’t delete all the cases with missing values because we could be introducing a bias into our data. Page 7 of 9

Introduction to Evidence for Health Science Practitioners

In the demographic data we can leave the cells blank and Excel will recognise them as missing data. However, if we have a missing data point when adding up questionnaire scores then this could affect the total score for the individual and the average score for the group. We will need to scrutinise all the data for the scales that we are intending to score and deal with missing data on a case by case basis. Calculating the average For those items that had missing data on the K10, we shall calculate the average of the remaining K10 items for that individual. If the participant had only one missing data in the K10 variables, then the average is the sum total of the other 9 questions divided by 9. For example, if the missing data is for ID# 12 on variable K10_7, we would first add all the other K10 variables up:

K10_1 + K10_2 + K10_3 + K10_4 + K10_5 + K10_6 + K10_8 + K10_9 + K10_10 = The average is the total divided by the number of variables (i.e. 9). We can use Excel to do this calculation for us by following these steps: 1. Click inside a cell 2. Type in an equals sign = 3. Type the word ‘sum’ followed by an open bracket sign ( 4. Drag the cursor over the cells you would like Excel to add up 5. Close the bracket by entering in the closed bracket sign ) 6. Hit enter and it will have added up all the remaining scores 7. Then in another cell, click equals, click on the total score you just calculated, type in the divide by sign / and 9, then hit enter and you will get the average score. It is useful to keep a record of your data changes using a table such as the one below. ID #

Item

e.g. 123

K10_7

Average of other K10 variables for the case 24/9 = 2.7

Do another sweep (sort) of all the K10 variables to check that there are no missing data in the cells.

Page 8 of 9

Introduction to Evidence for Health Science Practitioners

CD-RISC-10 We can treat the CD-RISC-10 data in exactly the same way as the K10 by systematically checking each column and inputting the average of the remaining items for that individual. This table can be used to keep a record of any data changes. ID #

Item

e.g. 123

CD-RISC-10_5

Average of other CD-RISC variables for the case 19/9 = 2.1

5. Sav e your work

References Campbell-Sills, L. & Stein, M.B. (2007). Psychometric analysis and refinement of the Connor–Davidson Resilience Scale (CD-RISC): Validation of a 10-Item measure of resilience. Journal of Traumatic Stress, 20(6), 1019-1028. doi: 10.1002/jts.20271 healthdirect (2020). Resting heart rate. Retrieved from https://www.healthdirect.gov.au/resting-heartate#:~:text=For%20adults%2C%20a%20normal%20resting,around%2040%20 beats%20a %20minute Kessler, R. C., Andrews, G., Colpe, L. J., Hiripi, E., Mroczek, D. K., Normand, S. L., . . . Zaslavsky, A. M. (2002). Short screening scales to monitor population prevalences and trends in non-specific psychological distress. Psychological Medicine, 32(6), 959-976. doi: 10.1017}S0033291702006074

Page 9 of 9...


Similar Free PDFs