If test results remain consistent when researchers conduct a study, its reliability ensures value to the field of psychology and other areas in which it has relevance, such as education or business. Low reliability alerts researchers to the fact that they should change certain aspects of their current test or study or conduct a new one to improve its value.
Here are a few methods researchers use to assess the reliability of their studies and tests:. Internal reliability refers to how well a resource maintains consistency within itself. To measure internal reliability, which applies specifically to tests, researchers often use the split-half method.
This process involves dividing a test in half before administering it to a participant and comparing the results of each half. If a researcher finds that each portion of the test yields similar results, the test then has internal reliability.
Researchers can divide a test in half using several methods, such as dividing the first and second half, grouping random questions or separating even- and odd-numbered questions. Smith created an exam on a specific psychological concept for his college students. The test contained questions about the same topic. To assess the reliability of the test, he split it into halves and gave half of the students the first half of the test and the other half of the students the second half of the test.
Both groups of students performed similarly, therefore confirming the reliability of the exam. External reliability is the ability of a test to yield the same results both over time and from each individual who takes it. It involves two methods: test-retest and inter-rater. Test-retest measures how well a test remains stable after repeated uses.
If a test remains stable, it maintains its reliability. Inter-rater reliability, known as inter-observer reliability when measuring the reliability of research studies, tests whether different raters or observers record the same data based on the protocol of a certain test or study. Example: Betty, Ron and Jane are gymnastics judges. Because opinions about gymnasts' performances vary, they use a standardized scoring system to ensure that they decide on scores using the same protocol.
If the system shows that the judges use and interpret the data in a similar way, the scoring system has inter-rater reliability. Researchers use the results of assessments to improve the reliability of their tests and studies. Here are some tips you can use to improve the reliability of your own psychology resources:. A test requires a defined measurement technique in order to assess its reliability. One approach is to look at a split-half correlation.
This involves splitting the items into two sets, such as the first and second halves of the items or the even- and odd-numbered items. Then a score is computed for each set of items, and the relationship between the two sets of scores is examined. For example, Figure 4. For example, there are ways to split a set of 10 items into two sets of five. Many behavioral measures involve significant judgment on the part of an observer or a rater.
Inter-rater reliability is the extent to which different observers are consistent in their judgments. Validity is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability.
When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity.
Here we consider three basic kinds: face validity, content validity, and criterion validity. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities.
So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity.
Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally. Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. It is also the case that many established measures in psychology work quite well despite lacking face validity.
The Minnesota Multiphasic Personality Inventory-2 MMPI-2 measures many personality characteristics and disorders by having people decide whether each of over different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure.
For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation leading to nervous feelings and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something.
By this conceptual definition, a person has a positive attitude toward exercise to the extent that he or she thinks positive thoughts about exercising, feels good about exercising, and actually exercises.
Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure. A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them.
For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking.
Criteria can also include other measures of the same construct. For example, one would expect new measures of test anxiety or physical risk taking to be positively correlated with existing established measures of the same constructs.
Construct validity does not concern the simple, factual question of whether a test measures an attribute. To test for construct validity it must be demonstrated that the phenomenon being measured actually exists. So, the construct validity of a test for intelligence, for example, is dependent on a model or theory of intelligence. Construct validity entails demonstrating the power of such a construct to explain a network of research findings and to predict further relationships.
The more evidence a researcher can demonstrate for a test's construct validity the better. However, there is no single method of determining the construct validity of a test.
Instead, different methods and approaches are combined to present the overall construct validity of a test. For example, factor analysis and correlational methods can be used. This is the degree to which a test corresponds to an external criterion that is known concurrently i. If the new test is validated by a comparison with a currently existing criterion, we have concurrent validity. Very often, a new IQ or personality test might be compared with an older but similar test known to have good validity already.
This is the degree to which a test accurately predicts a criterion that will occur in the future. For example, a prediction may be made on the basis of a new intelligence test, that high scorers at age 12 will be more likely to obtain university degrees several years later.
If the prediction is born out then the test has predictive validity. Cronbach, L. Psychological Bulletin , 52, Hathaway, S. Manual for the Minnesota Multiphasic Personality Inventory. New York: Psychological Corporation. Nevo, B. Face validity revisited.
0コメント