Week 1 – Test construction
Measuring instrument → any method that leads to quantitative data
Test → measuring instrument that consists of several components (items, from which total score is
determined)
Questionnaire → can be a test, when calculating the total score (test can also measure attitude for
example; not just about performance)
Subscales/subtests → measuring instruments that consist of several coherent tests
Measuring vs. classification:
Classification is based on a division that is non-debatable
- Aries, Taurus, …
- Schizophrenic, psychotic, …
Measuring is based on a theory that can be tested (e.g. Voltmeter).
- Concepts can change in response to data
- e.g. Mathematics test → language, maths, calculation
Cycle for constructing a test:
COTAN Assessment system:
,Basic rule when assessing validity and reliability:
It is only good once it has been proven to be good
- A good for reliability means that the reliability has been properly investigated AND that the
conclusion of this investigation was that the reliability is good
- A fail for reliability means that the reliability was insufficiently investigated AND / OR that it
was investigated and that the conclusion was that the reliability is insufficient
- Analogue with validity
Phases in a validation study:
- In the construction of a test, a validation study has to be performed to investigate reliability
and validity
1. Preparation:
- Choice of the type of properties that will be measured (for each property, a separate -
subscale of several items must be made)
- Exploring the domain with literature and interviews (using existing scales increases the
comparability of your research)
2. Formulation of the items: determining extent to which the items appear to be suitable for
content
- Contents of the individual items (each item only measures the chosen domain, not another
domain)
- Representativeness of the collection of items (items must be a good representation of the
domain)
- Number of items (sufficient items per subscale are needed;
- Precise formulation of the items (items must not be susceptible to multiple interpretations
and must be adapted to language and comprehensibility of target group)
- Content and number of response categories (items that garner the same response from
every test takes are not informative, so use about 5-7 answer categories)
- Expert judgment (ask panel of experts about the quality of items)
3. Planning first administration:
- Use of multiple assessors (interrater reliability can be determined)
- Other variables that have to be measured (background variables)
- Number of test subjects that are needed
4. Initial decrease of the scale: data collection takes place
5. Analysis of data of individual items: pre-selection, in which worst items are removed
- Inter-rater reliability of items (whether observers have high degree of conformity with
cohen´s kappa)
- Variance of analyses (items with small variance are less suitable because they contribute
little to differentiation & items with variance 0 must be removed)
- Skewness and unisonness of the distribution of items (in most forms of factor analysis it is
assumed that the items are normally distributed; items that deviate strongly from symmetry
are removed)
6. Analysis of the relationships between items: whether items measure the same
(homogeneity or unidimensionality)
- Correlations between the items(items of the same scale must correlate positively because
they should measure the same property)
, - Factor analysis of items (examined more closely whether items measure the same aspect;
if items are not unidimensional, the scale may have to be split into subscales or items are
removed)
- Analysis of the internal consistency reliability (examines whether the scale has enough
items to view the sum score as a reliable measurement; items with negative contribution
to reliability are removed)
7. Standardization: examine what averages, SD and percentiles are for the target groups, in
order to define a norm
8. Analysis of the relationships between test scores and other variables:
- Test-retest reliability (examines how stable the scores are over time)
- Criterion validation (extent to which test can predict other variables is investigated)
- Construct validation (examines whether the test truly has the theoretically expected
relationships
Construct validity:
Theoretical interpretability
Do we understand what is measured?
Do the theoretical expectations come true?
- If there are not theories about the construct, the test cannot be construct valid! There
should be theories that can be falsified. They should be investigated and not be proven
wrong.
A dimension is just an aspect that we are using to characterize people. We want unidimensional tests
(can be determined with factory analysis and item response theory).
Unidimensionality:
- Means that the items measure the same thing. Component of construct validity, because if
the items have to measure the same construct, they must at least measure the same.
The models used to examine unidimensionality have the following assumptions:
- Unidimensionality → each person can be characterized with a single number that indicated
to what extent that person has the property that is intent to be measured. This value is
generally unknown and is therefore called the latent trait. For an intelligence test, this would
be someone's unknown true intelligence. That value is indicated by the Greek letter theta.
- Monotonicity → if there is some quantity that we want to measure and this quantity
increases, the probability of a correct answer will increase with the underlying property
, (when ability/fear increases, the probability of a correct answer increases (fear increases →
more vomiting → more peed in pants)
- Local independence → within a group of subjects with the same value of theta, the items are
not correlated. In the total population, there may be high correlations between the items.
For example: someone who has a low aggression should have a low score on all items used (except
for noise), and if that subject's aggression increases, this should show in all items. Another example is
a written exam. It is desirable that good students have a better chance on all questions. An exam
question on which good students score poorly – that is, a question on which one scores worse the
better one knows the subject matter – is undesirable.
Unidimensionality and monotonicity cannot be distinguished empirically. If you would only assume
unidimensionality, without monotonicity or something alike, it would not be testable (Sijtsma &
Junker, 2006, p. 86). The same applies to local independence. For this reason, the term
unidimensionality is often used for the three assumptions jointly. The word therefore has two
meanings.
What is the difference between unidimensionality and internal consistency reliability?:
Unidimensionality implies that the items measure the same trait, save for noise, but it does not say
anything about the size of the noise component. Internal consistency reliability says something
about the size of the noise in the total score, but does not give an answer to the question whether
the items measure the same trait.
For example: the arithmetic items '3 + 4 = ?' and '5 + 2 = ?' are unidimensional, but their total score is
not reliable because there are only two items. Conversely, the total score of an IQ test usually has a
high reliability, but intelligence is not unidimensional, because there are various types of intelligence
(such as fluid intelligence and crystallized intelligence).
- We start with some numerical assumption (there is something that can be measured and
that is not visible; like intelligence), but it has some relationships to observations that we can
do namely the responses that people give on the items of the test.
- We compare the predictions of the model with the data that we gathered and then accept
the model for the data (quantity exists and we can measure it with the test) or the
predictions are wrong and there is something wrong with the model and we cannot measure
the thing we want to measure with these items.
Guttmann scale:
- The items and the persons can be arranged in such a way that the person answers an item
correctly if the ability (scale) of the person larger than the difficulty of the item.
- According to this model: if you fail on item B you will also fail on item C