Javascript required
Skip to content Skip to sidebar Skip to footer

The _______ Is the Ability of a Test to Measure What It Is Designed to Do.

Affiliate v: Psychological Measurement

Reliability and Validity of Measurement

  1. Define reliability, including the different types and how they are assessed.
  2. Define validity, including the different types and how they are assessed.
  3. Describe the kinds of evidence that would be relevant to assessing the reliability and validity of a particular measure.

Again, measurement involves assigning scores to individuals so that they stand for some characteristic of the individuals. But how do researchers know that the scores actually represent the characteristic, especially when information technology is a construct like intelligence, cocky-esteem, low, or working retentivity chapters? The reply is that they acquit research using the measure to ostend that the scores make sense based on their agreement of the construct being measured. This is an extremely important point. Psychologists do not justpresume that their measures piece of work. Instead, they collect information to demonstrate that they piece of work. If their inquiry does not demonstrate that a mensurate works, they stop using it.

Every bit an informal case, imagine that yous have been dieting for a month. Your apparel seem to be fitting more loosely, and several friends take asked if yous have lost weight. If at this point your bathroom scale indicated that you lot had lost 10 pounds, this would brand sense and you lot would continue to use the scale. Just if it indicated that yous had gained ten pounds, you would rightly conclude that information technology was broken and either fix it or get rid of information technology. In evaluating a measurement method, psychologists consider two general dimensions: reliability and validity.

Reliability

 refers to the consistency of a measure. Psychologists consider three types of consistency: over time (exam-retest reliability), beyond items (internal consistency), and across different researchers (inter-rater reliability).

Test-Retest Reliability

When researchers measure a construct that they assume to exist consistent across time, and then the scores they obtain should also exist consistent beyond time.  is the extent to which this is actually the case. For instance, intelligence is mostly idea to be consistent beyond time. A person who is highly intelligent today will be highly intelligent side by side calendar week. This means that any good measure out of intelligence should produce roughly the aforementioned scores for this private next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot exist a very expert measure of a construct that is supposed to be consequent.

Assessing exam-retest reliability requires using the measure out on a group of people at i fourth dimension, using it again on theaforementioned group of people at a later fourth dimension, and then looking at  between the ii sets of scores. This is typically done by graphing the data in a scatterplot and calculating Pearson'sr. Figure 5.2 shows the correlation between 2 sets of scores of several university students on the Rosenberg Self-Esteem Calibration, administered two times, a calendar week apart. Pearson's r for these information is +.95. In general, a examination-retest correlation of +.lxxx or greater is considered to bespeak practiced reliability.

Score at time 1 is on the x-axis and score at time 2 is on the y-axis, showing fairly consistent scores
Figure 5.2 Test-Retest Correlation Between Two Sets of Scores of Several College Students on the Rosenberg Self-Esteem Scale, Given Ii Times a Week Apart

Again, high test-retest correlations make sense when the construct beingness measured is assumed to be consequent over time, which is the case for intelligence, self-esteem, and the Big 5 personality dimensions. Only other constructs are non causeless to be stable over time. The very nature of mood, for case, is that it changes. So a measure of mood that produced a low exam-retest correlation over a period of a month would non exist a cause for concern.

Internal Consistency

A 2d kind of reliability is , which is the consistency of people's responses beyond the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the aforementioned underlying construct, so people'southward scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that that they have a number of skilful qualities. If people's responses to the dissimilar items are not correlated with each other, so it would no longer make sense to merits that they are all measuring the same underlying construct. This is as true for behavioural and physiological measures as for cocky-report measures. For example, people might make a series of bets in a fake game of roulette as a measure of their level of risk seeking. This measure out would be internally consistent to the extent that private participants' bets were consistently high or depression beyond trials.

Like test-retest reliability, internal consistency tin can only be assessed past collecting and analyzing data. 1 arroyo is to look at a . This involves splitting the items into ii sets, such as the offset and second halves of the items or the even- and odd-numbered items. Then a score is computed for each set of items, and the relationship betwixt the 2 sets of scores is examined. For example, Figure five.3 shows the split-half correlation between several university students' scores on the even-numbered items and their scores on the odd-numbered items of the Rosenberg Self-Esteem Scale. Pearson'sr for these information is +.88. A split-half correlation of +.80 or greater is mostly considered adept internal consistency.

Score on even-numbered items is on the x-axis and score on odd-numbered items is on the y-axis, showing fairly consistent scores
Effigy five.3 Split up-Half Correlation Between Several Higher Students' Scores on the Fifty-fifty-Numbered Items and Their Scores on the Odd-Numbered Items of the Rosenberg Self-Esteem Scale

Perhaps the most common mensurate of internal consistency used by researchers in psychology is a statistic chosen  (the Greek letter of the alphabet alpha). Conceptually, α is the mean of all possible carve up-half correlations for a set up of items. For example, there are 252 means to carve up a prepare of 10 items into two sets of v. Cronbach's α would be the mean of the 252 divide-half correlations. Annotation that this is not how α is really computed, but it is a right way of interpreting the significant of this statistic. Again, a value of +.lxxx or greater is generally taken to bespeak good internal consistency.

Interrater Reliability

Many behavioural measures involve meaning judgment on the part of an observer or a rater.  is the extent to which unlike observers are consequent in their judgments. For example, if you were interested in measuring university students' social skills, you lot could make video recordings of them as they interacted with some other educatee whom they are meeting for the outset time. Then y'all could have 2 or more observers sentinel the videos and charge per unit each student'southward level of social skills. To the extent that each participant does in fact have some level of social skills that can be detected by an attentive observer, unlike observers' ratings should be highly correlated with each other. Inter-rater reliability would also have been measured in Bandura's Bobo doll study. In this case, the observers' ratings of how many acts of aggression a particular child committed while playing with the Bobo doll should have been highly positively correlated. Interrater reliability is often assessed using Cronbach's α when the judgments are quantitative or an analogous statistic called Cohen'southward κ (the Greek letter kappa) when they are chiselled.

Validity

 is the extent to which the scores from a measure represent the variable they are intended to. But how practise researchers make this judgment? We have already considered 1 factor that they take into account—reliability. When a measure out has good examination-retest reliability and internal consistency, researchers should be more than confident that the scores represent what they are supposed to. At that place has to be more to information technology, however, because a measure tin can exist extremely reliable just have no validity whatsoever. As an absurd example, imagine someone who believes that people'south index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people'southward alphabetize fingers. Although this measure out would accept extremely practiced test-retest reliability, information technology would take admittedly no validity. The fact that one person'southward index finger is a centimetre longer than another's would point zippo about which 1 had higher self-esteem.

Discussions of validity usually carve up it into several distinct "types." But a good style to interpret these types is that they are other kinds of bear witness—in add-on to reliability—that should be taken into account when judging the validity of a measure. Here we consider three basic kinds: face validity, content validity, and criterion validity.

Face Validity

 is the extent to which a measurement method appears "on its face up" to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they run across themselves as a person of worth and whether they think they have proficient qualities. And then a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nil to do with self-esteem and therefore has poor face validity. Although face validity can exist assessed quantitatively—for instance, by having a large sample of people charge per unit a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.

Face validity is at best a very weak kind of evidence that a measurement method is measuring what information technology is supposed to. Ane reason is that information technology is based on people'south intuitions well-nigh human behaviour, which are frequently wrong. Information technology is likewise the case that many established measures in psychology piece of work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-two) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not accept whatsoever obvious relationship to the construct that they mensurate. For example, the items "I savor detective or mystery stories" and "The sight of blood doesn't frighten me or make me sick" both mensurate the suppression of aggression. In this example, information technology is not the participants' literal answers to these questions that are of interest, but rather whether the pattern of the participants' responses to a series of questions matches those of individuals who tend to suppress their aggression.

Content Validity

 is the extent to which a measure "covers" the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous arrangement activation (leading to nervous feelings) and negative thoughts, then his measure of test feet should include items about both nervous feelings and negative thoughts. Or consider that attitudes are normally divers equally involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that he or she thinks positive thoughts about exercising, feels proficient about exercising, and actually exercises. So to accept skillful content validity, a measure out of people's attitudes toward practice would have to reflect all three of these aspects. Like face validity, content validity is not normally assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.

Criterion Validity

 is the extent to which people's scores on a measure are correlated with other variables (known as ) that one would await them to exist correlated with. For example, people's scores on a new measure of exam anxiety should exist negatively correlated with their performance on an important schoolhouse exam. If information technology were found that people'south scores were in fact negatively correlated with their exam performance, and so this would be a piece of evidence that these scores really represent people's examination anxiety. But if it were found that people scored every bit well on the exam regardless of their test feet scores, and then this would cast doubtfulness on the validity of the measure.

A criterion can be any variable that one has reason to think should be correlated with the construct beingness measured, and there volition ordinarily be many of them. For case, one would wait test anxiety scores to be negatively correlated with exam functioning and form grades and positively correlated with general feet and with claret pressure during an examination. Or imagine that a researcher develops a new measure of physical risk taking. People's scores on this measure should be correlated with their participation in "extreme" activities such equally snowboarding and rock climbing, the number of speeding tickets they take received, and even the number of broken bones they have had over the years. When the criterion is measured at the aforementioned fourth dimension as the construct, criterion validity is referred to as ; however, when the benchmark is measured at some point in the futurity (after the construct has been measured), it is referred to as (because scores on the measure have "predicted" a future outcome).

Criteria can also include other measures of the same construct. For case, one would wait new measures of test anxiety or physical risk taking to be positively correlated with existing measures of the same constructs. This is known every bit .

Assessing convergent validity requires collecting data using the measure. Researchers John Cacioppo and Richard Petty did this when they created their self-study Need for Cognition Calibration to measure out how much people value and engage in thinking (Cacioppo & Petty, 1982)[1]. In a serial of studies, they showed that people'southward scores were positively correlated with their scores on a standardized bookish accomplishment test, and that their scores were negatively correlated with their scores on a measure of dogmatism (which represents a tendency toward obedience). In the years since it was created, the Need for Cognition Scale has been used in literally hundreds of studies and has been shown to be correlated with a wide variety of other variables, including the effectiveness of an advertisement, interest in politics, and juror decisions (Footling, Briñol, Loersch, & McCaslin, 2009)[2].

Discriminant Validity

, on the other mitt, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For instance, self-esteem is a general attitude toward the self that is fairly stable over fourth dimension. It is not the same as mood, which is how good or bad ane happens to be feeling right now. So people'south scores on a new measure of self-esteem should non be very highly correlated with their moods. If the new mensurate of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is non really measuring cocky-esteem; it is measuring mood instead.

When they created the Need for Cognition Scale, Cacioppo and Petty as well provided testify of discriminant validity past showing that people'south scores were not correlated with certain other variables. For example, they institute only a weak correlation between people's need for cognition and a measure of their cognitive style—the extent to which they tend to retrieve analytically past breaking ideas into smaller parts or holistically in terms of "the big picture." They as well establish no correlation between people's demand for cognition and measures of their test anxiety and their tendency to answer in socially desirable means. All these low correlations provide evidence that the measure is reflecting a conceptually singled-out construct.

  • Psychological researchers do not only assume that their measures piece of work. Instead, they conduct research to show that they work. If they cannot bear witness that they piece of work, they stop using them.
  • There are two distinct criteria past which researchers evaluate their measures: reliability and validity. Reliability is consistency beyond time (test-retest reliability), across items (internal consistency), and beyond researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to.
  • Validity is a judgment based on diverse types of evidence. The relevant evidence includes the measure'southward reliability, whether it covers the construct of interest, and whether the scores it produces are correlated with other variables they are expected to be correlated with and not correlated with variables that are conceptually distinct.
  • The reliability and validity of a measure is not established by any single report simply by the design of results beyond multiple studies. The assessment of reliability and validity is an ongoing procedure.
  1. Practice: Inquire several friends to complete the Rosenberg Self-Esteem Scale. Then assess its internal consistency by making a scatterplot to show the split-one-half correlation (even- vs. odd-numbered items). Compute Pearson'sr as well if you know how.
  2. Word: Think back to the final college test yous took and think of the exam as a psychological measure. What construct exercise you think it was intended to measure out? Comment on its face and content validity. What data could you lot collect to assess its reliability and criterion validity?

liddellcartiong.blogspot.com

Source: https://opentextbc.ca/researchmethods/chapter/reliability-and-validity-of-measurement/