1.
Introduction
4 characteristics that make a scale a summated rating scale
1. Contains multiple items
2. Each individual item measures something that has an underlying, quantitative measurement
continuum
3. Each item has “no” right answer
4. Each item is a statement
The summated rating-scale format is often used for several reasons
- It can generate scales with good psychometric properties, meaning that a well-developed
composite assessment scale can have good reliability and validity
- A summated rating scale is relatively cheap and easy to develop
- A well-designed scale is usually quick and easy for respondent to complete and usually does
not lead to complaints from them
Why use multiple-item scales? – reliability, scope, and precise
Many measured characteristics are broad in scope and not easily assessed with a single question.
Three reasons why single yes-or-no questions are insufficient.
1. Unreliable – People don't consistently respond the same way to single items over time
2. Broad scope – many measured characteristics are broad in scope and cannot be assessed with
a single question
3. Imprecise - They are not very accurate because they only allow for two levels of
measurement. People can only be put into two groups, and there's no way to tell the
differences among people within each group
Example:
A frequently studied domain is people's feeling about the government.
To assess feelings, a single question could be asked, such as
Do you like the government? (Yes or No).
Unfortunately, all people who respond "yes" will not have the same strength of feeling.
Some may love the government; others may only slightly like it. Likewise, some people
responding "no" will hate the government, whereas others will merely dislike it. People in
the middle who have ambivalent feelings will be forced to choose either yes or no and will
, be counted along with those with strong feelings. Thus, there is inadequate precision for
most purposes.
Unreliability in people’s responses over time will be produced in several ways
- Ambivalent people (people who have mixed feelings) – may be making random responses to
the question. Depending on the day, the person’s mood, weather, etc.
- Respondents who make a mistake in their response – they may misread/misunderstand the
question. Answers “yes” instead of “no”
- People’s feelings are not that simple – they may like certain aspects and not others
Multiple items can address all three problems
- Reliability – allowing random errors of measurement to average out. Given 20 items, if a
respondent makes an error on one item, indicating "love it" instead of "hate it," the impact on
the total score (the sum of all items) is quite minimal.
- Scope – The variety of questions enlarge the scope of what is measured. People would respond
to items concerning various aspects of government. They might be asked if they like the
President, the Congress, the Supreme Court, and the Civil Service.
- Precision – the use of more than two response choices
How do you feel about the government?
Love it
Like it
Neither like nor dislike
Dislike it
Hate it
Example:
What makes a good scale?
… A good summated rating scale is both reliable and valid.
Reliability = assures that a scale can consistently measure something
Reliability will be considered in two ways:
, 1. Test-retest reliability – It means that a scale gives reliable measurements consistently over
time. If what we're measuring doesn't change, each person should get a similar score when
tested again.
2. Internal-consistency reliability – means that multiple items, designed to measure the same
construct, will intercorrelate with one another. In other words, several questions made to
measure the same thing will show connections or similarities with each other.
*It is possible that a scale demonstrates only one of these types of reliability. Will be discussed
Chapter 5 & 6
Validity = assures that it will measure what it is designed for
*Will be discussed in Chapter 6
… A good scale has clear, well-written items that contain a single idea
… A good scale is appropriate to the population of people who use it (reading level)
… A good scale is developed with concern for possible biasing factors. Questions that touch on
personal matters might make some people defensive.
Steps of scale construction:
, 2.
Theory of summed Rating Scales
Classical test theory distinguished true score from observed score
True score = the expected value that each person has for the trait or thing we're interested in
measuring
Observed score = the score actually derived from the measurement process
The theory behind summated rating scales, like the Work Locus of Control Scale (WLCS), comes
from classical test theory. This theory distinguishes between true and observed scores. True scores
are theoretical values representing a person's actual standing on a trait, while observed scores are the
scores we measure, influenced by random errors.
According to classical test theory:
O=T+E
- O: Observed score
- T: True score
- E: Random error
The idea is that if measurement is perfectly reliable and valid, the observed score would equal the true
score. Errors are assumed to be random and average out over multiple observations.
For summated rating scales, each item is considered an observation of the intended trait. By averaging
or summing these items, errors in measurement are assumed to average out, resulting in a more
accurate estimate of the true score.
Increasing the number of items in a scale can enhance reliability by compensating for errors in
individual items. However, this doesn't guarantee validity. Even with high reliability, poor items may
not measure the intended trait.
Classical test theory simplifies measurement by assuming only true scores and random errors.
However, it overlooks biases that can affect responses. Bias (B) is a systematic influence on observed
scores that doesn't reflect the true score. Social desirability (SD) is a common bias where individuals
respond in a socially acceptable way, affecting the accuracy of measurements.
The formula can be extended to include bias:
O=T+E+B
- B: Bias