Test construction
College 1
What test for particular aims and particular groups of interest.
How to evaluate quality and interpret test scores.
Know what parameters mean in formula and which to use for which computation
- Test construction: what does a test look like, instructions for administration, administration
- Test theory: stasticial theory about behaviour of item scores and test scores. Quality of items
etc.
Both needed for sensible use of tests, not only about construction
Tests are broadly used
- Human resource management: personnel selection and development
- Education: individual development and performance of students.
Identify deviating patterns of development – pupil assessment system. Mandatory by law
Prediction of most suitable type of high school. Also mandatory in NL. End of primary school
CITO toets.
- Psychodiagnostics: neuropsychology, clinical psychology, developmental psychology
Mainly used as judgement of individuals
But also use of tests in research
to test hypothesis or build a theory
here in research mainly judgements of populations, group level, group comparison etc.
what is a test?
A psychological or educational test is an instrument for the measurement of a person’s maximum or
typical performance under standardized conditions, where the performance is assumed to reflect one or
more latent attributes
Typical performance test: typifies a person – no correct answers. Use them to describe a person. Just
the way someone is. Personality, attitude, mental health
Maximum performance test: persons achievement. Someone needs to do their best. Correct and
incorrect answers. Intelligence, ability
Standardization
Test conditions are fixed. Conditions should be the same in application to person a and person b. test
material, instructions, administration procedure, score computing. Not only instructions are tricky, but
also with material. You cannot just rewrite items meanwhile. Score computing, more often computer
based therefore less sensitive to errors, but with observational instruments need very clear scoring for
example.
Aim: to ensure comparability of test performances between persons and test occasions.
Difficult to achieve perfect standardization. Train test leaders.
Specific aspects to standardize dependent on for example test or target population
Latent attribute: test aims to measure one or more from these. Verbal ability, arithmetic skills,
severity of depression
Attribute that cannot be seen directly. You need indicators for these so you need a test. Test score (X)
should reflect the latent attribute of interest (T, true score). Causal relationship between X and T.
,indicitates: 2 people differ in attribute so the test scores should as well, and other way around. But
there is always measurement error.
Item
Smallest test unit, on which a person is scored. Score can be same as a person response. Items can be
clustered together in a subtest or subscale as an independent part of a test and indicative of an
attribute.
Subtest (subscale or scale)
- Independent part of a test
- Indicative of an attribute
- Consists of various items
Example Bayley-III
Aims to assess the developmental level of young children 1 to 42 months. Individual standardized
assessment. Normed scores. Assessing the developmental level by playing. More observational
instrument. Used when there are concerns about child development or to diagnose developmental
delays in order to plan or evaluate interventions
7 subscales. 2 belong to other subscale language and 2 to motor.
Example of item construction: gross motor
Described when to receive points. Rating can be done consistently due to clear instructions
Example of item scoring: cognition scoring form
Example of item scores for number of children on 9 items in SPSS.
Test construction process
1 Define construct of interest → 2 develop test → 3 pilot studies for feedback (intuitive process so you
can adjust the test and repeat pilot until you are satisfied) → 4 data collection and analysis → 5
validation and norming. Check figure for interaction
1) Construct
Abstract and theoretical. Literature research. There are no golden standards for intelligence for
example. Often begins by people in practice finding they do not have a test proper enough to work
with for their aim and they want to construct a test. Takes a lot of time and money to construct a new
test.
Important to consider homogeneity (one construct, indicators fit together) and dimensionality (do I
want to measure 1 construct or a collection of different constructs that together tell me for example
about personality then you can have a multidimensional test)
In personality we do not say that there is 1 personality score. BIG 5 often used. 5 unidimensional
constructs with each different subtests. You cannot combine the 5 and say you have a high score on
personality, this singular score doesn’t mean anything.
2) Developing a test
Essential aspects
1. Measurement mode of the test. Do you want someone to measure performance themselves (self-
performance mode), fill out a test themselves (self-evaluation) or other evaluation mode (psychologist
evaluates)
2. Objectives of the test
3. Population and subpopulations of testees 4. Conceptual framework of the test
,5. Item response mode
6. Administration mode
7. Item writing
Measurement mode of the test
▪ self-performancemode
▪ self-evaluation mode
▪ other-evaluationmode
Example with different modes of administration SDQ
Brief behavioural screening questionnaire. Strengths and difficulties that children can encounter
questionnaire. Very broad and widely used instrument. For 3 to 16 year olds. Exists in several versions
to meet needs of researchers, clinicians and educationalists. Available in many languages. In primary
school sent to parents.
25 items on psychological attributes on 5 scales. Some more negative focused on difficulties and more
positive strength subscale prosocial behaviour. You can work with these 2 testscores or just with the 5
subscale scores.
Possible as self-report version or parent version. One of the main disadvantages of self-report is that
people can inflate their scores to appear more appealing.
Objectives of the test (what is its aim)
- Research or practice
- Individual or group level
- Description vs diagnosis vs decision making (when to start what treatment based on scores)
Different choices (so your aim) have consequences for norming and validation.
Population and subpopulations of testees
Be as specific as possible. Inclusion and exclusion criteria. Too broad causes more implications for
norm groups and their representativeness. Because you need norms for the entire population. How are
you gonna collect data for a population of which the definition is vague or too broad?
Like range of age, nationality specifics etc.
Administration mode
- Oral
- Paper and pencil
- Computerized
- Computerized adaptive test administration
Conceptual framework of the test
More specific than just definition: it helps to write items.
Typical performance: three broad classes of strategies
- intuitive: rational, prototypical
- deductive:
- construct method: use of theoretical framework (e.g. Koster et al)
- facet design method: conceptual analysis of the construct. You make it smaller and smaller
so you can narrow the kind of construct in the items.
- inductive: constructs to be measured cannot be defined beforehand, but are identified using
association measures (e.g., correlations).
- internal: associations among items
- external: associations between items and external criterion (predictive validity)
there is no theoretical foundation for personality tests as biggest critique.
, Example internal based strategy
16PF. Self-report measuring 16 primary traits based on factor analysis of variables describing a broad
range of actual behaviors. Create different clusters through factor analyses and look for labels for
them.
Factor analysis: to identify subgroups of variables.
- With high correlations within subgroups
- With low correlations between subgroups
Inductive strategy (?) is a useful approach to describe differences between individuals in personality
characteristics. But it does not reveal sources of or causes of differences in personality.
Response mode
- Many
- Frequently used scales
dichotomous = binary
ordinal polytomous: never/sometimes/often
Item writing
Book describes different concrete guidelines. Both for typical and maximum performance test items
In general:
- Each item represents one idea
- Be specific
- Use pos and neg formulated items
- Avoid expressions and jargon
- Consider the reading level of user
- Avoid the use of “not”, its confusing.
Example:
Do you like football is not a good question. Is it about watching? Or about practicing it?
So make sure people know what you mean through a pilot study
Pilot study
Check whether instructions and items are clear
Three types of studies
- Experts pilot: concept items are reviewed by experts for your construct that you measure
- Test takers pilot: concept items are administered to small test takers from your targed
population. Target population use is critical. Useful to use read aloud protocol (read items out
loud) or think out loud protocol (let them say what they think out loud)
- Raters pilot: yields important info to remove items, remove raters (bad at following
instructions bc they feel sorry for participant etc) and or improve training of raters. Focuses on
Interrater agreement and intrarater consistency.
Measures of agreement
- Interrater agreement: 2 different raters of same objects. Individuals or items
- Intrarater consistency: the same rater rates consistently over same objects over multiple times
Measures of agreement per scale type
- Nominal and ordinal with two categories: dichotomous, trichotomous, false -correct.
Measurement of agreement: kappa. Know kappa!!