Test construction college 1 3 februari
Introduction – developing maximum and typical performance tests Iris Egberink
Learning goals
- To know and understand the principles of test and questionnaire construction
- To know how tests and questionnaires for a particular aim and a particular group are
effectively constructed, evaluated and interpreted
Topics
- The process of test construction
- Methods for understanding psychometric properties
- The principles of various item response models and their application in practice
- Important issues of validity and norm-referencing
Exam
- Book, articles, lectures, exercises, practical
- Multiple choice
- Formulae
o Standard statistics by heart (mean, proportion, variance, SD, covariance, correlation,
variance of sum variable)
o Other, if necessary, will be provided at the exam – without name and explanation of
parameters
- Simple, non-programmable calculator
- Example questions on Nestor
Sincere advice
- Read course manual thoroughly
- Prepare each lecture, study the material after each lecture, use video-lectures if needed
- Be optimally prepared for the first exam – don’t postpone to resit be careful with ‘learning
to the test’
Psychological and educational tests
- Test construction – development and application
o what does the test look like?
o Instructions for administration, scoring and interpretation
o Actual administrations of tests
What info does it give?
What is the usefulness of this info, and for whom (individuals, policy)?
- Test theory – statistical theory about behavior of item scores and test scores – What can I do
with the outcome?
o Examples – Classical test theory, item response theory (quality)
o Important issues – quantitative measures for the quality of items and tests for target
groups of respondents
- Both are needed for a sensible use of tests
Use of tests – in practice
1. Human Resource Management – personnel selection and development
2. Education – individual development and performance of students
a. Identify deviating patterns of development (pupil assessment
system/leerlingvolgsysteem)
b. Prediction of most suitable type of high school (end of primary school/CITO-toets
groep 8)
, 3. Psychodiagnostics – Npsy, clinical psych, developmental psych
- Judgments on individuals
Use of tests – in research
- Testing of hypothesis, theory; theory building
- E.g. ‘location and size of brain damage determines type and severity of behavioral difficulties
in the long term’
- Variables
o Indicators of location and size of brain damage
o Behavioral difficulties – e.g. anxiety, aggression, childish behavior, apathy, lack of
insight
- Judgments on populations/groups
Definition of a test – a psychological or educational test is an instrument for the measurement of a
person’s maximum or typical performance under standardized conditions, where the performance is
assumed to reflect one of more latent attributes
Test types
- Typical performance test
o Typifies person – no correct answers
o E.g. personality, attitude, mental health
- Maximum performance test
o Person’s achievement
o E.g. intelligence, ability level
Standardization – very important aspect of testing
- Test conditions are fixed
o E.g. test material, instructions, administration procedure, score computing
- Aim – to ensure comparability of test performances between persons and test occasions
- Difficult to achieve perfect standardization – write out specific instruction to give the
participants
- Specific aspects to standardize dependent on for example test or target population
Latent attribute
- Attribute that cannot be measured directly
o E.g. verbal ability, arithmetic skills, severity
of depression
- Test score (X) should reflect the latent attribute of
interest (T; true score)
o Causal relationship between attribute and test score
o Thus, if 2 persons differ on the attribute, the test scores differ as well, and the other
way around
- Testscore is indicators of attribute
Some important terminology
- Item
o Smallest test unit, on which person is scored
o Score can be the same as persons response
- Subtest (also denoted as subscale, or just scale)
o Independent part of a test
o Indicative of an attribute
, o Consists of various items
Example of maximum performance test
- Bayley-III
o Aims to assess the developmental level of young children (1-42m)
o Individual, standardized assessment
o Normed scores
o Assessing the developmental level by playing
o Aims of use
For children with concerns about development
Diagnosis of developmental delays, in order to plan and/or evaluate
interventions
o Consisting of 5 (or 7) subscales
Administered with child interaction
Cognition
Language
o Reception
o Production
Motor
o Fine
o Gross
Parent questionnaires
Social-emotional
Adaptive behavior
o Example of item instruction – Gross motor
Bal laten schoppen succesvol 1; onsuccesvol (vallen, niet ver genoeg) 0
Test construction
1. Define the construct of interest
a. Constructs abstract, theoretical concepts
b. Literature search – what is intelligence? What part of intelligence?
c. Homogeneity and dimensionality – different dimensions could have different
subscales
d. …
2. Develop the test
a. Essential aspects
i. Measurement mode of the test
1. Self-performance mode
2. Self-evaluation mode
3. Other-evaluation mode
4. Example – SDQ
a. Strengths and Difficulties Questionnaire brief behavioral
screening questionnaire about 3-16y. Exists in several
versions to meet the need of researchers, clinicians and
educationalists.
b. 25 items on psychological attributes – all versions of the SDQ
ask about 25 attributes, some + others –
c. These 25 items are divided between 5 scales
i. Emotional symptoms – 5
ii. Conduct problems – 5
iii. Hyperactivity/inattention – 5
, iv. Peer relationship problems – 5
v. Prosocial behavior – 5
1 tm 4 are added to generate a total
difficulties score (based on 20 items)
d. Thus, either 2 subscales (total difficulties, prosocial), or 5
subscales
ii. Objectives of the test
1. Research vs. practice
2. Individual or group level
3. Description vs. diagnosis vs. decision making
iii. Population and subpopulation of testees
1. Be as specific as possible
2. Inclusion and exclusion criteria
3. Too broad implications for norm groups and their
representativeness
iv. Conceptual framework of the test
1. More specific than just definition; it helps to write items
2. Typical performance – three broad classes of strategies
a. Intuitive – rational, prototypical
b. Deductive
i. Construct method – use of theoretical framework
(e.g. Koster et al.)
ii. Facet design method – conceptual analysis of the
construct
c. Inductive – constructs to be measured cannot be defined
beforehand, but are identified using association measures
(e.g. correlations)
i. Internal – associations among items how they are
related to another
ii. External – associations between items and external
criterion (predictive validity)
3. Example internal based strategy
a. 16 personality factor questionnaire (16PF) – Cattell and co.
1940
b. Self-report measuring 16 primary traits
c. Based on factor analysis of variables describing broad range
of actual behaviors
i. FA – method to identify subgroups of variables
With high correlations within the subgroups
With low correlations between the
subgroups
d. Useful approach to describe differences between individuals
in personality characteristics
i. But it does NOT (and CANNOT) reveal sources
of/causes of differences in personality
v. Item response mode
1. Many, see book
2. Frequently-used scales
a. Dichotomous = binary
i. E.g. yes/no, true/false, correct/incorrect
ii. Typically encoded as 0, 1
b. Ordinal polytomous – e.g. never/sometimes/often