Tentamen (uitwerkingen)

Predicting Item Survival for Multiple Choice Questions in a High-stakes Medical Exam

5 keer bekeken 0 keer verkocht

Vak
Predicting Item Survival

Instelling
Predicting Item Survival

and complexity of individual words as candidates for a fillin-the-blanks test and this ranking is used to estimate the difficulty of the particular example. A slightly different approach to predicting test difficulty is presented in Pado´ (2017), where each question is manually annotated and l...

[Meer zien]

Voorbeeld 2 van de 7 pagina's

Bekijk voorbeeld

Geupload op 25 augustus 2024
Aantal pagina's 7
Geschreven in 2024/2025
Type Tentamen (uitwerkingen)
Bevat Vragen en antwoorden

predicting item survival for multiple choice quest
a 55 year old woman with small cell carcinoma of t

Instelling Predicting Item Survival
Vak Predicting Item Survival

Volgen

TIFFACADEMICS

Lid sinds 1 jaar 553 documenten verkocht

€14,80

Toegevoegd

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na betaling
Zowel online als in PDF
Je zit nergens aan vast

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 6812–6818
Marseille, 11–16 May 2020
c European Language Resources Association (ELRA), licensed under CC-BY-NC

Predicting Item Survival for Multiple Choice Questions in a High-stakes
Medical Exam
Victoria Yaneva1 , Le An Ha2 , Peter Baldwin1 , Janet Mee1
1 - National Board of Medical Examiners, Philadelphia, USA
2 - Research Institute in Information and Language Retrieval, University of Wolverhampton, UK
{vyaneva, pbaldwin, jmee}@nbme.org; l.a.ha@wlv.ac.uk

Abstract
One of the most resource-intensive problems in the educational testing industry relates to ensuring that newly-developed exam questions
can adequately distinguish between students of high and low ability. The current practice for obtaining this information is the costly
procedure of pretesting: new items are administered to test-takers and then the items that are too easy or too difficult are discarded. This
paper presents the first study towards automatic prediction of an item’s probability to “survive” pretesting (item survival), focusing on
human-produced MCQs for a medical exam. Survival is modeled through a number of linguistic features and embedding types, as well
as features inspired by information retrieval. The approach shows promising first results for this challenging new application and for
modeling the difficulty of expert-knowledge questions.

Keywords: Multiple Choice Questions, Difficulty Prediction, Educational Applications

1. Introduction pretesting slots for items that are more likely to pass the
Large-scale testing relies on a pool of test questions, which thresholds. To address these issues, we present a method
must be replenished, updated, and expanded over time1 . for modeling item survival within a large-scale real-world
Writing high-quality test questions is challenging as they data set of multiple choice questions (MCQs) for a high-
must satisfy certain quality standards before they can be stakes medical exam.
used to score examinees. These standards are based on sta- Contributions: i) The paper introduces a new practical ap-
tistical criteria and ensure that: i) items are not too easy plication area of NLP related to predicting item survival
or too difficult for the intended examinee population, and for improving high-stakes exams. ii) The developed mod-
ii) the probability of success on each item is positively re- els outperform three baselines with a statistically signif-
lated to overall examinee performance (Section 3.). While icant difference, including a strong baseline of 113 lin-
the exact thresholds vary, most exam programs have such guistic features. iii) Owing to the generic nature of the
a requirement. Even when item writers are well-trained features, the presented approach is generalizable to other
and adhere to industry best practices, it has generally not MCQ-based exams. iv) We make our code available2 at:
been possible to identify which items will satisfy the vari- https://bit.ly/2EaTFNN.
ous statistical criteria without first obtaining examinee re-
sponses through pretesting. Pretesting involves embedding
2. Related Work
new items within a standard live exam and, based on the Predicting item survival from item text is a new application
collected responses, a determination is made about whether area for NLP and, to the best of our knowledge, there is no
or not a given item satisfies conditions i) and ii). Items that prior work investigating this specific issue. The problem
meet the criteria are considered to have “survived” pretest- is, however, related to the limited available research on pre-
ing and can later be used to score examinees. The propor- dicting question difficulty with the important difference that
tion of surviving items varies across programs; however, predicting survival involves predicting an additional item
Brennan (2006) recommends pretesting at least twice the parameter that captures the relation between the probabil-
number of items needed. ity of success for the individual item and overall examinee
While necessary, the enterprise of pretesting is costly. performance (Section 3.).
Scored items compete with pretest items for exam space, With regards to estimating question difficulty for humans,
the scarcity of which can create a bottleneck. As a result, it the majority of studies focus on applying readability met-
is sometimes not possible to pretest as many new items as rics to language comprehension tests, where the compre-
needed and some exam programs may not be able to afford hension questions refer to a given piece of text and, there-
pretesting at all. This problem is expected to grow with ad- fore, there is a relationship between the difficulty of the
vances in automatic question generation (Gierl et al., 2018), two (Huang et al., 2017; Loukina et al., 2016). For exam-
where a large amount of new questions are generated but ple, Loukina et al. (2016) investigate the extent to which
there is no criteria on how to evaluate their suitability for the difficulty of listening items in an English language pro-
live use. Conceivably, having advance knowledge of an ficiency test can be predicted by the textual properties of
item’s probability to survive can allow using the available the prompt by using text complexity features (e.g. syn-
tactic complexity, cohesion, academic vocabulary, etc). In
1
This constant need for new test questions arises as the popu- another study, Beinborn et al. (2015) rank the suitability
lation of test-takers grows, new topics for exam content are iden-
2
tified, item exposure threatens exam security, etc. The questions cannot be released because of test security.

6812

, A 55-year-old woman with small cell carcinoma of the lung is admitted to the hospital to undergo
chemotherapy. Six days after treatment is started, she develops a temperature of 38C (100.4F).
Physical examination shows no other abnormalities. Laboratory studies show a leukocyte count of
100/mm3 (5% segmented neutrophils and 95% lymphocytes).
Which of the following is the most appropriate pharmacotherapy to increase this patient’s leukocyte
count?
(A) Darbepoetin (B) Dexamethasone
(C) Filgrastim (D) Interferon alfa
(E) Interleukin-2 (IL-2) (F) Leucovorin

Table 1: An example of a practice item

and complexity of individual words as candidates for a fill- ing a set of guidelines, stipulating adherence to a standard
in-the-blanks test and this ranking is used to estimate the structure. These guidelines required avoidance of “win-
difficulty of the particular example. A slightly different dow dressing” (extraneous material not needed to answer
approach to predicting test difficulty is presented in Padó the item), “red herrings” (information designed to mislead
(2017), where each question is manually annotated and la- the test-taker), and grammatical cues (e.g., correct answers
belled with the cognitive activities and knowledge neces- that are longer or more specific than the other options).
sary to answer it based on Bloom’s Taxonomy of Educa- Item writers had to ensure that the produced items did not
tional Objectives (Bloom and others, 1956). The results have flaws related to various aspects of validity. For exam-
indicate that questions that are low in Bloom’s hierarchy of ple, flaws related to irrelevant difficulty include: Stems or
skills are easier to answer than ones high in the hierarchy. options are overly long or complicated, Numeric data not
Nadeem and Ostendorf (2017) approach the same problem stated consistently and Language or structure of the options
in an opposite way, where they aim to predict the skills re- is not homogeneous. Flaws related to “testwiseness” are:
quired to solve assessment questions using a convolutional Grammatical cues; The correct answer is longer, more spe-
neural network (CNN). The ultimate goal of their experi- cific, or more complete than the other options; and A word
ments is to use annotated data with labels of such skills in or phrase is included both in the stem and in the correct
order to automatically populate a Q-matrix of skills used answer. The goal of standardizing items in this manner is
in education to determine how questions should be graded to produce items that vary in their difficulty and discrimi-
(e.g., more points should be awarded for solving questions nating power due only to differences in the medical content
that require more skill). they assess.
Alsubait et al. (2013) show that the difficulty of newly gen- The items were administered within a standard nine-hour
erated questions can be manipulated by changing the simi- exam, and test-takers had no way of knowing that they
larity between item components, e.g. the distractors and the would not be scored on these items. Each nine-hour exam
correct answer, the question and the distractors, the ques- contained approximately 40 pretest items and the data was
tion and the correct answer, etc. This assumption is later collected through embedding the items in different live
on used by Ha and Yaneva (2018) in automatic distractor exam forms for four consecutive years (2012 - 2015). On
generation for multiple choice questions, where the system average, each item was answered by 328 examinees (SD =
can rank distractors based on various similarity metrics. 67.17). Examinees were medical students from accredited3
In our prior work we predict MCQ difficulty and mean re- US and Canadian medical schools taking the exam for the
sponse times using a large number of linguistic features, first time as part of a multistep examination sequence re-
in addition to embeddings (Ha et al., 2019; Baldwin et al., quired for medical licensure in the US.
2020). The results presented in Ha et al. (2019) show that To survive, items had to satisfy two criteria:
the proposed approach predicts the difficulty of the ques-
• A proportion of correct answers between .30 and .95,
tions with a statistically significant improvement over sev-
i.e., the item had to be answered correctly by no fewer
eral baselines. As will be seen in Section 4., we use the full
than 30% and no more than 95% of test-takers. Within
list of linguistic features to obtain a strong baseline pre-
the educational-testing literature, this proportion of
diction for item survival. More details on the individual
correct answers is commonly referred to as a P-value.
features and their explanations can be found in Section 4..
We adopt this convention here but care should be taken
not to confuse this usage with a p-value indicating sta-
3. Data
tistical significance. The P-value is calculated in the
Data comprises 5,918 pretested MCQs from the Clinical following way:
Knowledge component of the United States Medical Li- PN
censing Examination (USMLE R ). An example of a test Un
Pi = n=1 ,
item is shown in Table 1. The part describing the case is N
referred to as stem and the incorrect answer options are
known as distractors. All items tested medical knowl- 3
Accredited by the Liaison Committee on Medical Education
edge and were written by experienced item-writers follow- (LCME).

6813

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper TIFFACADEMICS. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €14,80. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 82871 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen

Laatst bekeken door jou

College aantekeningen ·

(0)

Tentamen (uitwerkingen)

Predicting Item Survival for Multiple Choice Questions in a High-stakes Medical Exam

Document informatie

Onderwerpen

Geschreven voor

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud