Summary Test Construction
Book
Mellenbergh, G. J. (2011). A conceptual introduction to psychometrics. Den Haag: Eleven
international publishing. ISBN: 9789490947293
Content
Chapter 1 – Introduction
Chapter 2 – Developing maximum performance tests
Chapter 3 – Developing typical performance tests
Chapter 4 – Observed test scores
Chapter 5 (appendix excluded) – Classical analysis of observed test scores
Chapter 6 (6.1, 6.1.1, 6.1.2, 6.2, and 6.2.1) – Classical analysis of item scores
Chapter 7 – Principles of item response theory
Chapter 8 (appendix excluded) – Examples of models for dichotomous item responses
Chapter 10 (10.1, 10.1.1, 10.1.2, 10.2, 10.2.1, and 10.2.3) – IRT-based analysis of latent
variables and item responses
Chapter 11 – Test validation
Chapter 12 – Reference points for test score interpretations
Chapter 14 (14.2, 14.3, and 14.3.1) – Examples of IRT-based applications
, Chapter 1 – Introduction
1.1 Origins of psychometrics
Three roots of modern testing (DuBois):
- Civil service examinations (ancient China).
o Examine candidates for government positions in six fields: music, archery,
horsemanship, writing, arithmetic, and the rites and ceremonies of public and
private life.
- The assessment of academic achievement.
o Examinees are graded in 4 categories: honor, satisfactory, charity pass, and failure.
o The Jesuit order used written tests for the placement of students and the evaluation
of educational instruction.
- The study of individual differences in behavior.
o Procedures of physical and mental measurements and for personal characteristics.
Grades assigned to the same examination paper could vary considerably between examiners, which
is at the basis of the concepts of measurement error and reliability.
1.2 Test definitions
Test = a psychological or educational test is an instrument for the measurement of a person’s
maximum or typical performance under standardized conditions, where the performance is assumed
to reflect one or more latent attributes.
- A test is defined to measure a performance.
o A maximum performance test asks the person to do their best to solve one or more
problems. The answers can vary in correctness (correct, partly correct, incorrect).
▪ Maximum performance tests are subdivided according to the type of
maximum performance and to the latent attribute that is measured.
▪ E.g., intelligence and achievement tests.
o A typical performance/response test asks the person to respond to one or more
tasks, where the responses are typical for the person. The responses can’t be
evaluated on correctness, but they typify the person.
▪ Typical performance tests are only subdivided according to the attribute that
is measured.
▪ E.g., personality and attitude tests.
- Performances are measured under standardized conditions.
o The test instructions, materials, and the administration procedure are fixed for
different test takers and on different administration occasions.
o But, standardization depends on the target population.
▪ E.g., Bayley-III-SNA (Special Needs Addition) --> adjustments in material (e.g.,
bigger objects to grab), instructions (e.g., no time limit), and procedures.
- The test performance reflects one or more latent attributes.
o Latent attributes are usually called latent variables. They effect test performance.
o The test performance/score is observable, but latent attributes cannot be observed
or measured directly (e.g., verbal ability and depression).
o Test score (X) should reflect the latent attribute of interest (T; true score) and the
measurement error.
▪ There is a causal relationship between attribute and test score.
- Tests are distinguished from surveys.
o It is not assumed that survey questions reflect a latent attribute.
▪ Survey questions on sex, civil status, car ownership, etc. are not assumed to
reflect latent attributes.
, ▪ However, these questions can be used to form a measurement index. An
index of stress can be formed by combining answers to a list of negative life
events (e.g., death, disease, divorce, etc.).
Subtest (or subscale/scale) = an independent part of a test.
- It is indicative of an attribute and it consists of various items.
Item = the smallest possible subtest of a test.
- A test consists of n items, and is called a n-item test.
Dimensionality = the number of latent attributes, which effects test performance.
- Unidimensional test --> a test that predominantly measures one latent attribute (variable).
- Multidimensional test --> a test that measures more than one latent attribute (variable).
- Two-dimensional test --> a test that measures two latent attributes (variables).
1.3 Test types
Psychological and educational measurement instruments are divided into mental and physical tests.
- A mental test consists of cognitive tasks, such as problems and questions.
- A physical test consists of instruments to make somatic or physiological measurements, such
as instruments for measuring the galvanic skin response, heart rate, and brain activities.
A performance can be considered maximum in two different respects:
- The performance is accurate (power tests).
o A power test consists of problems that the test taker tries to solve. The test taker has
ample time to work on each of the test items, even on the most difficult items.
▪ The emphasis is on measuring the accuracy to solve the problems.
o The time for administering a test is usually limited (time-limited power tests).
- The performance is fast (speed tests).
o The speed test measures the speed taken to solve problems. Usually, the test
consists of very easy items that can be solved by all the test takers.
▪ The emphasis is on measuring the time taken to solve problems.
Maximum performance tests are also classified according to the attributes which they measure. Two
main types of attributes are distinguished: ability and achievement.
- An ability test (or aptitude test) is an instrument for measuring a person’s best performance
in an area that is not explicitly taught in training and educational programs.
- An achievement test is an instrument for measuring performance that is explicitly taught in
training and educational programs.
A typical performance test is an instrument for measuring behavior that is typical for the person.
Three main types of typical performance tests are distinguished:
- Personality tests measure a person’s personality characteristics, such as neuroticism,
extraversion, dominance, etc.
- Interest inventories measure a person’s interests in teaching, gardening, social service, etc.
- Attitude questionnaires measure a person’s attitude towards civil rights, labor unions,
religion, etc.
, Chapter 2 – Developing maximum performance tests
The development of a test starts with the making of a plan. The plan specifies a number of essential
elements of test development:
- The construct of interest.
- The measurement mode of the test.
- The objectives of the test.
- The population and subpopulations where the test should be applied.
- The conceptual framework of the test.
- The response mode of the items.
- The administration mode of the test.
Other essential aspects of test development:
- Item writing guidelines.
- Item rating guidelines.
- Pilot studies on item quality.
- Compiling the first draft of the test.
2.1 The construct of interest
The test developer must define the latent variable of interest that has to be measured by the test.
- ‘Latent variable’ will be used as a general term, while ‘construct’ will be used when a
substantive interpretation is given of the latent variable.
Constructs vary in different ways:
- Constructs vary in content from mental abilities (e.g., intelligence) to psychomotor skills (e.g.,
manual dexterity) and physical abilities (e.g., athletic capacity).
- Constructs vary in scope (e.g., from general intelligence to multiplication skill).
- Constructs vary from educational to psychological variables.
2.2 The measurement mode
Measurement modes of maximum performance tests:
- Self-performance mode = ask test takers to perform a mental or physical task.
o E.g., a student is asked to solve a numerical problem.
- Self-evaluation mode = instead of performing the task, the test taker is asked to evaluate
their ability to perform the task.
o E.g., a student is asked how good they are at solving numerical problems.
- Other-evaluation mode = ask others to evaluate a person’s ability to perform a task.
o E.g., a teacher is asked to assess the students’ numerical ability.
2.3 The objectives
The test developer must specify the objectives of the test.
Some distinctions that are relevant for the planning of the development of a test:
- Distinction between scientific/research (e.g., to study human intellectual functioning) and
practical purposes (e.g., to select job applicants or to assess students’ math achievements).
- Distinction between individual test takers (e.g., accept or reject an applicant for a job) and a
group of test takers (e.g., use mean test scores to compare educational achievements of
students from different countries).
- Distinction between description, diagnosis, and decision-making.
o Description = describe performances.
▪ E.g., therapists may apply tests to better understand their clients.
o Diagnosis = adding a conclusion to a description.