BOOK: MARK LEARY 2012 – INTRODUCTION TO BEHAVIORAL RESEARCH METHODS
CHAPTER 3 – THE MEASUREMENT OF BEHAVIOUR
Behavioural research= the measurement of some behavioural, cognitive, emotional or physiological response.
Types of measures
3 categories:
- Observational measures
- Physiological measures
- Self-reports
Observational measures= the direct observation of behaviour. Can be used to measure anything a participant
does that researchers can observe (eye-contact between people in a conversation). Either observe directly of
make audio/video recordings.
Physiological measures= measures internal processes that are not directly observable (heart rate, brain
activity, etc.) with sophisticated equipment (some processes can be observed with the naked eye, but
specialized equipment is needed to measure them accurately). Use when interested in the relationship
between bodily processes and behaviour.
Self-report measures= the replies people give to questionnaires and interviews. Provide information about
people’s thoughts, feelings or behaviour.
- Cognitive self-reports: measure what people think about something. Ask about attitudes toward a
political issue
- Affective self-reports: measure participants’ responses regarding how they feel. Most straightforward
way to measure this is by asking participants to report on them.
- Behavioural self-reports: participants’ reports on how they act. Ask how often participants read the
newspaper, go to church, etc.
Because measurement is so important to the research process, an entire specialty known as psychometrics is
devoted to the study of psychological measurement. Psychometricians investigate the properties of the
measures used in behavioural research and work toward improving psychological measurement.
Case study – Converging operations in measurement
Converging operations or triangulation= because any particular measurement procedure may provide only a
rough and imperfect measure of a given construct, researchers sometimes measure a given construct in several
different ways. When different kinds of measures provide the same results, we have more confidence in their
validity.
Pennebaker, Kiecolt-Glaser and Glaser (1988) research about the effects of writing about an experience on
health.
Hypothesis: people who wrote about traumatic events they had personally experienced would show an
improvement in their physical health.
N=50 university students.
Method: Write 20 minutes a day for 4 days about a traumatic event that they experienced or superficial topics.
They obtained:
- Observational measures: participant’s visits to the university health centre
- Physiological measures: functioning of participants’ immune systems (collected samples of
participants’ blood 3 times during the study and tested white blood cells)
- Self-report measures: how distressed participants felt (1h, 6w and 3m after the experiment)
Results: Those who wrote about traumatic experiences visited the health centre less frequently, showed better
functioning of their immune systems and reported they felt better.
,Scales of measurement
Goal of measurement= assign numbers of participants’ responses so that they can be summarized and
analysed.
Four levels or scales of measurement:
- Nominal scale (labels/descriptions/names). Can’t do calculations with it.
gender (1=men & 2=women), married or not
- Ordinal scale= rank ordering of a set of behaviours or characteristics. Indicate the relative order, but
not the distance between participants on the dimension.
Winner of a race, 2nd finisher of a race, 3rd finisher of a race, etc. The person who finished first (whom
we label 1) is not 1/10th as fast as the person who came in tenth (whom we label 10).
- Interval scale= equal differences between the numbers reflect equal differences between participants.
But no true zero-point.
IQ test, temperature
- Ratio scale= highest level of measurement. With a true zero-point, so real numbers can be added,
subtracted, multiplied and divided.
Weight
Scales of measurement are important to researchers for two reasons:
1. They determine the amount of information provided by a particular measure. Nominal scales usually
provide less information than ordinal, interval, or ratio scales.
In many cases, choice of a measurement scale is determined by the characteristic being measured; it
would be difficult to measure gender on anything other than a nominal scale, for example. However,
given a choice, researchers prefer to use the highest level of measurement scale possible because it will
provide the most pertinent and precise information about participants’ responses or characteristics.
2. The kinds of statistical analyses that can be performed on the data. The more useful and powerful
statistical analyses, such as t-tests and F-tests (which we’ll meet in later chapters), generally require
that numbers be on interval or ratio scales. As a result, researchers try to choose scales that allow
them to use the most informative statistical tests.
Assessing the reliability of a measure
A perfect measure would be one for which the variability in the numbers provided by our measuring technique
perfectly matched the true variability in whatever we are trying to measure. But our measures are never
perfect, so the variability in our data rarely reflects the variability in participants’ responses perfectly.
Since no measure can measure it perfectly, how do we know whether a particular
measurement technique provides us with scores that reflect what we want to
measure closely enough to be useful in our research? To answer this, we must look at
reliability and validity.
Reliability= consistency or dependability of a measuring technique.
Measurement error
A participant’s score on any measure consists of two components:
- the true score
- measurement error
➔ observed score = true score + measurement error
True score= score that the participant would have obtained if our measure were perfect and we were able to
measure whatever we were measuring without error.
,Measurement error= the result of factors that distort the observed score so that it isn’t precisely what it
should be (i.e., it doesn’t perfectly equal the participant’s true score). If Susan was anxious and preoccupied
when she took the IQ test, for example, her observed IQ score might be lower than 138.
Factors that contribute to measurement error:
- Transient states of the participant. A participant’s mood, health, level of fatigue, and feelings of
anxiety can all contribute to measurement error so that the observed score on some measure does not
perfectly reflect the participant’s true characteristics or reactions.
- Stable attributes of the participant. Paranoid or suspicious participants may purposefully distort their
answers, less intelligent participants, level of motivation of participants.
Both transient and stable characteristics can produce lower or higher observed scores than participants’ true
scores would be.
- Situational factors in the research setting. Friendly researcher, room temperature, noise, etc.
- Characteristics of the measure. Ambiguous questions, too long tests, etc.
- Actual mistakes in recording participants’ responses. Researchers lose count, etc.
The more measurement error present in a measuring technique, the less reliable the measure is. Anything that
increases measurement error decreases the consistency and dependability of the measure.
Relationship between reliability and measurement error:
ME= measurement error
Reliability as systematic variance
For certain kinds of measures, researchers have ways of estimating the reliability of the measures they use. If
they find that a measure is not acceptably reliable, they may take steps to increase its reliability. If the
reliability cannot be increased, they may decide not to use it at all.
Total variance in a set of scores= variance due to true scores + variance due to measurement error
Systematic variance Error variance
The variance due to true scores is called systematic variance, because the true-score component is related in a
systematic fashion to the actual attribute that is being measured.
The variance due to measurement error is error variance because it is not related to the attribute being
measured.
Reliability= true score variance / total variance
- .00 = no reliability (the scores reflect nothing but measurement error)
- 1.00 = perfect reliability (all true score variance)
70% of total variance is systematic/true-score variance = reliable measure
Three methods to estimate the reliability of measures:
1. Test-retest reliability
, 2. Interitem reliability
3. Interrater reliability
Correlation coefficient= a statistic that expresses the strength of the relationship between two measures on a
scale from .00 (no relationship between the two measures) to 1.00 (a perfect relationship between the two
measures). Can be positive (direct relationship) or negative (inverse relationship).
(1) Test-retest reliability
= the consistency of participants’ responses on a measure over time. Assuming that the characteristic being
measured is relatively stable and does not change over time, participants should obtain approximately the
same score each time they are measured.
Determined by measuring participants on two occasions, usually separated by a few weeks. Then the two sets
of scores are correlated to see how highly the second set of scores correlates to the first. If the two sets of
scores correlate highly (at least .70), the scores must not contain much measurement error, and the measure
has good test–retest reliability.
Only makes sense if the attribute being measured wouldn’t be expected to change between the two
measurements.
(2) Interitem reliability
= degree of consistency among the items on a scale
Relevant for measures that consist of more than one item (measures that contain multiple items measuring the
same construct are often called scales). Personality inventories, for example, typically consist of several
questions that are summed to provide a single score that reflects the respondent’s extraversion, self-esteem,
shyness, paranoia, or whatever.
When researchers sum participants’ responses to several questions or items to obtain a single score, they must
be sure that all the items are tapping into the same construct.
First: look at item-total correlation (=an item-total correlation is the correlation between a particular item and
the sum of all other items on the scale) for each question or item on the scale. Researchers want the item-total
correlation between each item and the sum of the other items to exceed .30. If a particular item does not
correlate with the sum of the other items (i.e., its item-total correlation is low), it must not be tapping into the
same “true score” as the other items.
Second: use split-half reliability as an index of interitem reliability, to know how reliable toe measure is. Divide
the items on the scale into two sets (first and second halves of the scale ór odd-numbered items & even-
numbered items ór randomly), then a total score for each set was obtained by adding the items within each
set, and correlation between the two sets was calculated.
Correlation of >0.70 = items on the scale measure the same construct. But if the split-half correlation is small,
the two halves of the scale are not measuring the same thing and thus the total score contains a great deal of
measurement error.
Drawback of split-half reliability:
- The reliability coefficient one obtains depends on how the items are split. Using a first-half/second-
half split is likely to provide a slightly different estimate of interitem reliability than an even/odd split.
What, then, is the real interitem reliability? To get around this ambiguity, researchers now use
Cronbach’s alpha coefficient (=equivalent to the average of all possible split-half reliabilities).
Adequate interitem reliability if Cronbach’s alpha >0.70 (=70% of the total variance in participants’
scores on the measure is systematic, true-score variance).
(3) Interrater reliability / interjudge reliability / interobserver reliability
= consistency among two or more researchers who observe and record participants’ behaviour.