CH. 6 CHARACTERISTICS OF EFFECTIVE SELECTION Though alternate-form differences potentially affect
TECHNIQUES test outcomes, most research indicates that these
Reliable, valid, cost-efficient, and legally defensible effects are either nonexistent or rather small.
RELIABILITY is the extent to which a score from a Internal Reliability/Consistency - The extent to which
selection measure is stable and free from error. similar items are answered in similar ways; measures
If applicants score differently each time they take a item stability.
test, we are unsure of their actual scores. the longer the test, the higher its internal
Test reliability is determined in four ways: test- consistency—that is, the agreement among
retest reliability, alternate-forms reliability, internal responses to the various test items.
reliability, and scorer reliability. item homogeneity - do all of the items measure the
same thing, or do they measure different
Test-Retest Reliability Method - each one of several constructs?
people take the same test twice. The more homogeneous the items, the higher the
The scores from the first administration of the test internal consistency.
are correlated with scores from the second to methods used to determine internal consistency:
determine whether they are similar. split-half, coefficient alpha, and K-R 20 (Kuder-
If they are, the test is said to have temporal Richardson formula 20).
stability: The test scores are stable across time and The split-half method is the easiest to use, as items
not highly susceptible to such random daily on a test are split into two groups. split into odd
conditions as illness, fatigue, stress, or even groups
uncomfortable testing conditions. o Because the number of items in the test has
No standard amt of time; the time interval should been reduced, researchers have to use a
be long enough so that the specific test answers formula called Spearman-Brown prophecy to
have not been memorized, but short enough so that adjust the correlation.
the person has not changed significantly. K-R 20 is used for tests containing dichotomous
Typical time intervals between test administrations items (e.g., yes/no, true/ false)
range from 3 days to 3 months. Usually, the longer the coefficient alpha can be used not only for
the time interval, the lower the reliability dichotomous items but for tests containing interval
coefficient and ratio items such as five-point rating scales.
Both are more popular and accurate methods; more
Alternate-Forms Reliability Method - two forms of the complicated; represent the reliability coefficient
same test are constructed. that would be obtained from all possible
half of the sample first receive Form A and the other combinations of split halves.
half Form B. This counterbalancing of test-taking
order is designed to eliminate any effects that Scorer Reliability- A test or inventory can have
taking one form of the test first may have on scores homogeneous items and yield heterogeneous scores
on the second form. and still not be reliable if the person scoring the test
The scores on the two forms are then correlated to makes mistakes.
determine whether they are similar. If they are, the When human judgment of performance is involved,
test is said to have form stability. scorer reliability is discussed in terms of interrater
two forms of the test are needed to reduce the reliability.
potential advantage to individuals who take the test will two interviewers give an applicant similar
a second time. This situation might occur in police ratings, or will two supervisors give an employee
department examinations. similar performance ratings?
the time interval should be as short as possible.
Any changes in a test potentially change its Evaluating the Reliability of a Test
reliability, validity, difficulty, or all three. When deciding whether a test demonstrates
Such changes might include the order of the items, sufficient reliability, two factors must be
examples used in the questions, method of considered: the magnitude of the reliability
administration, and time limits. coefficient and the people who will be taking the
test.
, The reliability coefficient for a test can be obtained If every applicant is hired, a wide range of both test
from your own data, the test manual, journal scores and employee performance is likely to be
articles using the test, or test compendia found, and the wider the range of scores, the higher
To evaluate the coefficient, you can compare it with the validity coefficient.
reliability coefficients typically obtained for similar most criterion validity studies use a concurrent
types of tests. design.
The second factor to consider is the people who will Why is a concurrent design weaker than a predictive
be taking your test. For example, if you will be using design? The answer lies in the homogeneity of
the test for managers, but the reliability coefficient performance scores.
in the test manual was established with high school the restricted range of performance scores makes
students, you would have less confidence that the obtaining a significant validity coefficient more
reliability coefficient would generalize well to your difficult.
organization. validity generalization, or VG—the extent to which
a test found valid for a job in one location is valid
VALIDITY is the degree to which inferences from scores for the same job in a different location.
on tests or assessments are justified by the evidence. With large sample sizes, a test found valid in one
For example, suppose that we want to use height location probably will be valid in another, providing
requirements to hire typists. that the jobs actually are similar and are not merely
But as we saw in the example above, a test’s two separate jobs sharing the same job title.
reliability does not imply validity. Instead, we think The two building blocks for validity generalization
of reliability as having a necessary but not sufficient are meta-analysis and job analysis.
relationship with validity. Meta-analysis can be used to determine the
average validity of specific types of tests for a
Content Validity— the extent to which test items variety of jobs.
sample the content that they are supposed to Validity generalization should be used only if a job
measure. analysis has been conducted, the results of which
the appropriate content for a test or test battery is show that the job in question is similar to those
determined by the job analysis. used in the meta-analysis
The readability of a test is a good example of how
tricky content validity can be. Construct Validity most theoretical of the validity
the personality inventory is very difficult to read types. the extent to which a test actually measures
(e.g., containing such words as meticulous, the construct that it purports to measure.
extraverted, gregarious) and most of our applicants Construct validity is concerned with inferences
are only high school graduates. Is our test content about test scores, in contrast to content validity,
valid? No, because it requires a high level of reading which is concerned with inferences about test
ability, and reading ability was not identified as an construction.
important dimension for our job. Construct validity is usually determined by
correlating scores on a test with scores from other
Criterion Validity, which refers to the extent to tests. Some of the other tests measure the same
which a test score is related to some measure of job construct, whereas others do not.
performance called a criterion Another method of measuring construct validity is
Criterion validity is established using one of two known-group validity - a test is given to two groups
research designs: concurrent or predictive. of people who are “known” to be different on the
With a concurrent validity design, a test is given to trait in question.
a group of employees who are already on the job. If the known groups do not differ on test scores,
The scores on the test are then correlated with a consider the test invalid. If scores do differ, one still
measure of the employees’ current performance. cannot be sure of its validity.
With a predictive validity design, the test is
administered to a group of job applicants who are CHOOSING A WAY TO MEASURE VALIDITY
going to be hired. The test scores are then which of the methods is the “best” to use? it
compared with a future measure of job depends on the situation as well as what the person
performance. conducting the validity study is trying to
accomplish.
, If it is to decide whether the test will be a useful computer-assisted testing can lower testing costs,
predictor of employee performance, then content decrease feedback time, and yield results in which
validity will usually be used, and a criterion validity the test-takers can have great confidence
study also will be conducted if theret are enough This increase in efficiency does not come at the cost
employees and if a good measure of job of decreased validity because, as mentionedkmgbv
performance is available. previously, tests administered electronically seem
In deciding whether content validity is enough, to yield results similar to those administered
“next-door neighbor rule.” through the traditional paper-and-pencil format.
o “If my next-door neighbor were on a jury and I computer-adaptive testing (CAT)- the computer
had to justify the use of my test, would content “adapts” the next question to be asked on the basis
validity be enough?” of how the test-taker responded to the previous
To get a significant validity coefficient, you need a question or questions.
good test, a good measure of performance, and a The logic behind CAT is that if a test-taker can’t
decent sample size. answer easy questions, it doesn’t make sense to ask
A test itself can never be valid. When we speak of difficult questions.
validity, we are speaking about the validity of the The advantages to CAT is that fewer test items are
test scores as they relate to a particular job. required, tests take less time to complete, finer
when we say that a test is valid, we mean that it is distinctions in applicant ability can be made, test-
valid for a particular job and a particular criterion. takers can receive immediate feedback, and test
No test will ever be valid for all jobs and all scores can be interpreted not only on the number
criteria. of questions answered correctly, but on which
questions were correctly answered.
Face validity is the extent to which a test appears to
be job related. Establishing the Usefulness of a Selection Device
This perception is important because if a test or its The Taylor-Russell tables provide an estimate of the
items do not appear valid, the test-takers and percentage of total new hires who will be successful
administrators will not have confidence in the employees if a test is adopted (organizational
results. success)
face validity motivates applicants to do well on both expectancy charts and the Lawshe tables
tests. provide a probability of success for a particular
Face-valid tests that are accepted by applicants applicant based on test scores (individual success)
decrease the chance of lawsuits, reduce the number the utility formula provides an estimate of the
of applicants dropping out of the employment amount of money an organization will save if it
process, and increase the chance that an applicant adopts a new testing procedure.
will accept a job offer
The face validity and acceptance of test results can Taylor-Russell tables are designed to estimate the
be increased by informing the applicants about how percentage of future employees who will be
a test relates to job performance and by successful on the job if an organization uses a
administering the test in a multimedia format particular test.
Barnum statements —statements so general that The first information needed is the test’s criterion
they can be true of almost everyone. validity coefficient.
o The higher the validity coefficient, the greater
Seventeenth Mental Measurements Yearbook the possibility the test will be useful.
(MMY) contains information about thousands of The second piece of information that must be
different psychological tests as well as reviews by obtained is the selection ratio, which is simply the
test experts. percentage of people an organization must hire.
o The lower the selection ratio, the greater the
COST-EFFICIENCY potential usefulness of the test.
If two or more tests have similar validities, then cost The final piece of information needed is the base
should be considered. rate of current performance—the percentage of
group testing is usually less expensive and more employees currently on the job who are considered
efficient than individual testing, although important successful.
information may be lost in group testing.
, Proportion of Correct Decisions adverse impact occurs if the selection rate for any
Determining the proportion of correct decisions is group is less than 80% of the highest scoring group
easier to do but less accurate than the Taylor- (practical significance) and the difference is
Russell tables. statistically significant (statistical significance).
The only information needed to determine the even if the test has adverse impact, it probably will
proportion of correct decisions is employee test be considered a legal test.
scores and the scores on the criterion.
single-group validity - the test will significantly
Lawshe Tables predict performance for one group and not others.
know the probability that a particular applicant will To test for single-group validity, separate
be successful correlations are computed between the test and
three pieces of information are needed: the validity the criterion for each group.
coefficient, the base rate, and the applicant’s test If both correlations are significant, the test does not
score. exhibit single-group validity and it passes this
o did the person score in the top 20%, the next fairness hurdle.
20%, the middle 20%, the next lowest 20%, or If, however, only one of the correlations is
the bottom 20%? significant, the test is considered fair for only that
one group.
Utility Formula rare
computing the amount of money an organization
would save if it used the test to select employees. differential validity- a test is valid for two groups
to estimate the monetary savings to an but more valid for one than for the other.
organization. rare
To use this formula, five items of information must usually in occupations dominated by a single sex,
be known. tests are most valid for the dominant sex, and the
a) Number of employees hired per year (n). tests overpredict minority performance
b) Average tenure (t). This is the average amount
of time that employees in the position tend to Remember, with single-group validity, the test is
stay with the company. valid only for one group. With differential validity,
c) Test validity (r). This figure is the criterion the test is valid for both groups, but it is more valid
validity coefficient that was obtained through for one than for the other.
either a validity study or validity generalization.
d) Standard deviation of performance in dollars If a test does not lead to adverse impact, does not
(SDy). have single-group validity, and does not have
e) Mean standardized predictor score of selected differential validity, it is considered to be fair.
applicants (m). a test must be valid, have utility, and be fair.
Determining the Fairness of a Test MAKING THE HIRING DECISION
The term bias or unbiased refers to technical Multiple regression - each test score weighted
aspects of a test. A test is considered biased if there according to how well it predicts the criterion.
are group differences (e.g., sex, race, or age) in test If more than one criterion-valid test is used, the
scores that are unrelated to the construct being scores on the tests must be combined.
measured.
The term fairness can include bias, but also includes Unadjusted Top-Down Selection
political and social issues. Typically, a test is applicants are rank-ordered on the basis of their
considered fair if people of equal probability of test scores. Selection is then made by starting with
success on a job have an equal chance of being the highest score and moving down until all
hired. openings have been filled.
The advantage to top-down selection is that by
adverse impact. There are two basic ways to hiring the top scorers on a valid test, an
determine this: looking at test results or organization will gain the most utility
anticipating adverse impact prior to the test. The disadvantages are that this approach can result
in high levels of adverse impact and it reduces an