Challenging the norm: Length of exams determined by classification accuracy or reliability

Abstract

Purpose This paper challenges the notion that reliability indices are appropriate for informing test length in exams in medical education, where the focus is on ensuring defensible pass-fail decisions. Instead, we argue that using classification accuracy instead better suited to the purpose of exams in these cases. We show empirically, using resampled test data from a range of undergraduate knowledge exams, that this is indeed the case. More specifically, we address the hypothesis that the use of classification accuracy results in recommending shorter test lengths as compared to when using reliability.

Method We analysed data from previous exams from both pre-clinical and clinical phases of undergraduate medical education. We used a re-sampling procedure in which both the cut-score and test length of repeatedly generated synthetic exams were varied systematically. N = 52 500 datasets were generated from the original exams. For each of these both reliability and classification accuracy indices were estimated.

Result Results indicate that only classification accuracy, not reliability, varies in relation to the cut-score for pass-fail decisions. Furthermore, reliability and classification accuracy are differently related to test length. The optimal test length for using reliability was around 100 items, independent of pass-rates. For classification accuracy, recommendations are less generic. For exams with a small percentage of failed decisions (i.e., 5% or less), an item size of 50 did, on average, achieve an accuracy of 95% correct classifications.

Conclusions We suggest a move towards the employment of classification accuracy using existing tools, whilst still using reliability as a complement. The benefits of re-thinking current test design practice include minimizing the burden of assessment on candidates and test developers. Item writers could focus on developing fewer, but higher quality, items. Finally, we stress the need to consider the effects of the balance false positive and false negative decisions in pass/fail classifications.

Metadata

Item Type:	Article
Authors/Creators:	Schauber, S.K. Homer, M. https://orcid.org/0000-0002-1161-5938
Copyright, Publisher and Additional Information:	This is an author produced version of an article published in Medical Education made available under the terms of the Creative Commons Attribution License (CC-BY), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited.
Dates:	Accepted: 13 May 2025 Published (online): 4 June 2025
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Education, Social Sciences and Law (Leeds) > School of Education (Leeds)
Depositing User:	Symplectic Publications
Date Deposited:	22 May 2025 15:31
Last Modified:	17 Jun 2025 12:59
Status:	Published online
Publisher:	Wiley
Identification Number:	10.1111/medu.15742
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:226675

CORE (COnnecting REpositories)

Challenging the norm: Length of exams determined by classification accuracy or reliability

Abstract

Metadata

Downloads

Accepted Version

Supplemental Material

Export

Statistics