Reliability Notes - Clear cut information PDF

Title Reliability Notes - Clear cut information
Author S.... Asif
Course research methodology
Institution Government College University Faisalabad
Pages 10
File Size 321.1 KB
File Type PDF
Total Downloads 49
Total Views 168

Summary

Clear cut information...


Description

1 RELIABILITY

RELIABILITY NOTES Reliability refers to the attribute of consistency in measurement. If the same result can be consistently achieved by using the same methods under the same circumstances, the measurement is considered reliable. For example, if a person weighs themselves during the course of a day they would expect to see a similar reading.   

Scales which measured weight differently each time would be of little use. The same analogy could be applied to a tape measure which measures inches differently each time it was used. It would not be considered reliable. If findings from research are replicated consistently they are reliable. A correlation coefficient can be used to assess the degree of reliability.

There are two types of reliability 1. Internal Reliability 2. External Reliability Internal reliability assesses the consistency of results across items within a test. If a test is reliable it should show a high positive correlation. External reliability refers to the extent to which a measure varies from one use to another.

2 RELIABILITY

Classical Test Theory and The Sources of Measurement Error: The theory of measurement introduced here has been called the classical test theory because it was developed from simple assumptions made by test theorists since the inception of testing. This approach is also Measurement called the theory of true and error scores. The basic starting point of the classical theory of measurement is the idea that test scores result from the influence of two factors: 1. Factors that contribute to consistency. These consist entirely of the stable attributes of the individual, which the examiner is trying to measure. 2. Factors that contribute to inconsistency. These include characteristics of the individual, test, or situation that have nothing to do with the attribute being measured, but that nonetheless affect test scores. X =T+e



Where X is the obtained score, T is the true score, and e represents errors of measurement. If e is positive, the obtained score X will be higher than the true score T.



Conversely, if e is negative, the obtained score will be lower than the true score.



Although it is impossible to eliminate all measurement error, test developers do strive to minimize this psychometric nuisance through careful attention to the sources of measurement error outlined in the following section.

Item Selection: The usual procedure in objective test construction is to prepare a larger pool of test items than will be used in the final form of the instrument. ... When the test is designed to rank subjects on some specified characteristics, item discriminating power is often used as the criterion for selection of items.

Test Administration: Although examiners usually provide an optimal and standardized testing environment, numerous sources of measurement error may nonetheless arise from the circumstances of administration. e.g. Examples of general environmental conditions that may exert an untoward influence on the accuracy of measurement include uncomfortable room temperature, dim lighting, and excessive noise.

Test Scoring:

3 RELIABILITY

Whenever a psychological test uses a format other than machine-scored multiple-choice items, some degree of judgment is required to assign points to answers. Fortunately, most tests have well-defined criteria for answers to each question. These guidelines help minimize the impact of subjective judgment in scoring (Gregory, 1987).

Systematic Measurement Error: 



The sources of inaccuracy previously discussed are collectively referred to as unsystematic measurement error, meaning that their effects are unpredictable and inconsistent. Here is another type of measurement error that constitutes a veritable ghost in the psychometric machine. A systematic measurement error arises when, unknown to the test developer, a test consistently measures something other than the trait for which it was intended.

Measurement Error and Reliability: We can summarize the main features of classical theory as follows (Gulliksen, 1950, chap.2): 1. Measurement errors are random. 2. Mean error of measurement = 0. 3. True scores and errors are uncorrelated: rTe = 0. 4. Errors on different tests are uncorrelated: r12 = 0.

The Reliability Coefficient: In more precise mathematical terms, the reliability coefficient (rXX) is the ratio of true score variance to the total variance of test scores.

The Correlation Coefficient: In its most common application, a correlation coefficient (r) expresses the degree of linear relationship between two sets of scores obtained from the same persons.   

Correlation coefficients can take n values ranging from -1.00 to +1.00. A correlation coefficient of +1.00 signifies a perfect linear between the two sets of scores. In particular, when two measures have a correlation of +1.00, the rank ordering of subjects is identical for both sets of scores.

The Correlation Coefficient as a Reliability Coefficient:

4 RELIABILITY

One use of the correlation coefficient is to gauge the consistency of psychological test scores. If test results are highly consistent, then the scores of persons taking the test on two occasions will be strongly correlated, perhaps even approaching the theoretical upper limit of +1.00. In this context, the correlation coefficient is also reliability coefficient.

Reliability as Temporal Stability Test–Retest Reliability: The most straightforward method for determining the reliability of test scores is to administer the identical test twice to the same group of heterogeneous and representative subjects.  



If the test is perfectly reliable, each person’s second score will be completely predictable from his or her first score. However, so long as the second score is strongly correlated with the first score, the existence of practice, maturation, or treatment effects does not cast doubt on the test– retest reliability of a psychological test. The test-retest reliability index is simply the zero-order correlation between the test scores at T1 and T2. Inevitably, measurement error comes into play, and scores will vary from T1 to T2: This might be due to random error, such as some participants having been feeling poorly on the day of the test at T1 and feeling well at T2 or the room having been overly warm at T2 compared to T1.

Internal Consistency Reliability Internal consistency reliability is a way to gauge how well a test or survey is actually measuring what you want it to measure. A simple example: you want to find out how satisfied your customers are with the level of customer service they receive at your call center. You send out a survey with three questions designed to measure overall satisfaction. Choices for each question are: Strongly agree/Agree/Neutral/Disagree/Strongly disagree. I was satisfied with my experience. I will probably recommend your company to others. If I write an online review, it would be positive. If the survey has good internal consistency, respondents should answer the same for each question, i.e. three “agrees” or three “strongly disagrees.” If different answers are given, this is a sign that your questions are poorly worded and are not reliably measuring customer satisfaction. Most researchers prefer to include at least two questions that measure the same thing (the above survey has three). Another example: you give students a math test for number sense and logic.

5 RELIABILITY

High internal consistency would tell you that the test is measuring those constructs well. Low internal consistency means that your math test is testing something else (like arithmetic skills) instead of, or in addition to, number sense and logic.

Testing for Internal Consistency: In order to test for internal consistency, you should send out the surveys at the same time. Sending the surveys out over different periods of time, while testing, could introduce confounding variables. An informal way to test for internal consistency is just to compare the answers to see if they all agree with each other. In real life, you will likely get a wide variety of answers, making it difficult to see if internal consistency is good or not. A wide variety of statistical tests are available for internal consistency; one of the most widely used is Cronbach’s Alpha. Average inter-item correlation finds the average of all correlations between pairs of questions. Split Half Reliability all items that measure the same thing are randomly split into two. The two halves of the test are given to a group of people and find the correlation between the two. The split-half reliability is the correlation between the two sets of scores. Kuder-Richardson 20: the higher the Kuder-Richardson score (from 0 to 1), the stronger the relationship between test items. A Score of at least 70 is considered good reliability.

Cronbach’s alpha, α (or coefficient alpha), developed by Lee Cronbach in 1951, measures reliability, or internal consistency. “Reliability” is how well a test measures what it should. For example, a company might give a job satisfaction survey to their employees. High reliability means it measures job satisfaction, while low reliability means it measures something else (or possibly nothing at all). Cronbach’s alpha tests to see if multiple-question Likert scale surveys are reliable. These questions measure latent variables — hidden or unobservable variables like: a person’s conscientiousness, neurosis or openness. These are very difficult to measure in real life. Cronbach’s alpha will tell you if the test you have designed is accurately measuring the variable of interest.

Parallel Forms Reliability: Parallel forms reliability (also called equivalent forms reliability) uses one set of questions divided into two equivalent sets (“forms”), where both sets contain questions that measure the same construct, knowledge or skill.

6 RELIABILITY

The two sets of questions are given to the same sample of people within a short period of time and an estimate of reliability is calculated from the two sets. Put simply, you’re trying to find out if test A measures the same thing as test B. In other words you want to know if test scores stay the same when you use different instruments. Example: you want to find the reliability for a test of mathematics comprehension, so you create a set of 100 questions that measure that construct. You randomly split the questions into two sets of 50 (set A and set B), and administer those questions to the same group of students a week apart. Steps:  Step 1: Give test A to a group of 50 students on a Monday.  Step 2: Give test B to the same group of students that Friday.  Step 3: Correlate the scores from test A and test B. In order to call the forms “parallel”, the observed score must have the same mean and variances. If the tests are merely different versions (without the “sameness” of observed scores), they are called alternate forms.

Similarity to Split-Half Reliability: Parallel forms and split-half reliability are similar, but with parallel forms, the same students take test A and then take test B. With split-half reliability, one group of students is split into two and both groups sit the test at the same time. The two tests in parallel forms reliability are equivalent and can be used independently of each other.

Advantages and Disadvantages Advantages: 

Parallel forms reliability can avoid some problems inherent with test-resting.

Disadvantages:   

You have to create a large number of questions that measure the same construct. Proving that the two test versions are equivalent (parallel) can be a challenge.

Split-Half Reliability: In split-half reliability, a test for a single knowledge area is split into two parts and then both parts given to one group of students at the same time. The scores from both parts of the test are correlated. A reliable test will have high correlation, indicating that a student would perform equally well (or as poorly) on both halves of the test. Split-half testing is a measure of internal consistency — how well the test components contribute to the construct that’s being measured. It is most commonly used for multiple choice tests you can theoretically use it for any type of test—even tests with essay questions. Steps

7 RELIABILITY

Administer the test to a large group students (ideally, over about 30). Randomly divide the test questions into two parts. For example, separate even questions from odd questions.  Score each half of the test for each student.  Find the correlation coefficient for the two halves. See: Find Pearson’s Correlation Coefficient for steps. Drawbacks One drawback with this method: it only works for a large set of questions (a 100 point test is recommended) which all measure the same construct/area of knowledge. For example, this personality inventory test measures introversion, extroversion, depression and a variety of other personality traits. This is not a good candidate for split-half testing.  

Difference with Parallel Forms Split half-reliability is similar to parallel forms reliability, which uses one set of questions divided into two equivalent sets. The sets are given to the same students, usually within a short time frame, like one set of test questions on Monday and another set on Friday. With split-half reliability, the two tests are given to one group of students who sit the test at the same time. Another difference: the two tests in parallel forms reliability are equivalent and are independent of each other. This is not true with split-half reliability; the two sets do not have to be equivalent (“parallel”).

Test-Retest Reliability/ Repeatability: Test-Retest Reliability (sometimes called retest reliability) measures test consistency — the reliability of a test measured over time. In other words, give the same test twice to the same people at different times to see if the scores are the same. For example, test on Monday, then again the following Monday. The two scores are then correlated. Bias is a known problem with this type of reliability test, due to:  Feedback between tests,  Participants gaining knowledge about the purpose of the test, so they are more prepared the second time around. This reliability test can also take a long time to calculate correlations for. Depending upon the length of time between the two tests, this could be months or even years.

Calculating Test-Retest Reliability Coefficients: Finding a correlation coefficient for the two sets of data is one of the most common ways to find a correlation between the two tests. Test-retest reliability coefficients (also called coefficients of stability) vary between 0 and 1, where:  1 : perfect reliability,  ≥ 0.9: excellent reliability,  ≥ 0.8 < 0.9: good reliability,  ≥ 0.7 < 0.8: acceptable reliability,

8 RELIABILITY

 ≥ 0.6 < 0.7: questionable reliability,  ≥ 0.5 < 0.6: poor reliability  < 0.5: unacceptable reliability,  0: no reliability. On this scale, a correlation of .9(90%) would indicate a very high correlation (good reliability) and a value of 10% a very low one (poor reliability).  For measuring reliability for two tests, use the Pearson Correlation Coefficient.One disadvantage: it overestimates the true relationship for small samples (under 15).  If you have more than two tests, use Intraclass Correlation. This can also be used for two tests, and has the advantage it doesn’t overestimate relationships for small samples. However, it is more challenging to calculate, compared to the simplicity of Pearson’s.

Inter-rater Reliability IRR: Inter-rater reliability is the level of agreement between raters or judges . If everyone agrees, IRR is 1 (or 100%) and if everyone disagrees, IRR is 0 (0%). Several methods exist for calculating IRR, from the simple (e.g. percent agreement) to the more complex (e.g. Cohen’s Kappa). Which one you choose largely depends on what type of data you have and how many raters are in your model.

Inter-Rater Reliability Methods 1. Percent Agreement for Two Raters The basic measure for inter-rater reliability is a percent agreement between raters. In this competition, judges agreed on 3 out of 5 scores. Percent agreement is 3/5 = 60%. To find percent agreement for two raters, a table (like the one above) is helpful. i. Count the number of ratings in agreement. In the above table, that’s 3. ii. Count the total number of ratings. For this example, that’s 5. iii. Divide the total by the number in agreement to get a fraction: 3/5. iv. Convert to a percentage: 3/5 = 60%. The field you are working in will determine the acceptable agreement level. If it’s a sports competition, you might accept a 60% rater agreement to decide a winner. However, if you’re looking at data from cancer specialists deciding on a course of treatment, you’ll want a much higher agreement — above 90%. In general, above 75% is considered acceptable for most fields.

Percent Agreement for Multiple Raters If you have multiple raters, calculate the percent agreement as follows: Step 1: Make a table of your ratings. For this example, there are three judges: Step 5: Find the mean for the fractions in the Agreement column. Mean = (3/3 + 0/3 + 3/3 + 1/3 + 1/3) / 5 = 0.53, or 53%. The inter-rater reliability for this example is 54%. Disadvantages As you can probably tell, calculating percent agreements for more than a handful of raters can quickly become cumbersome. For example, if you had 6 judges, you would have 16

9 RELIABILITY

combinations of pairs to calculate for each contestant (use our combinations calculator to figure out how many pairs you would get for multiple judges). A major flaw with this type of inter-rater reliability is that it doesn’t take chance agreement into account and overestimate the level of agreement. This is the main reason why percent agreement shouldn’t be used for academic work (i.e. dissertations or academic publications).

Alternative Methods Several methods have been developed that are easier to compute (usually they are built into statistical software packages) and take chance into account: • If you have one or two meaningful pairs, use Interclass correlation (equivalent to the Pearson Correlation Coefficient). • If you have more than a couple of pairs, use Intraclass correlation. This is one of the most popular IRR methods and is used for two or more raters. • Cohen’s Kappa: commonly used for categorical variables. • Fleiss’ Kappa: similar to Cohen’s Kappa, suitable when you have a constant number of m raters randomly sampled from a population of raters, with a different sample of m coders rating each subject. • Gwet’s AC2 Coefficient is calculated easily in Excel with the AgreeStats add on. Krippendorff’s Alpha is arguably the best measure of inter-rater reliability, but it computationally complex.

10 RELIABILITY...


Similar Free PDFs