Evaluation of a Generative Language Model Tool for Writing Examination Questions

Christopher J Edwards,Brian L Erstad
DOI: https://doi.org/10.1016/j.ajpe.2024.100684
Abstract:Objective: To describe an evaluation of a generative language model tool to write examination questions for a new elective course focused on the interpretation of common clinical laboratory results being developed as an elective for students in a Bachelor of Science in Pharmaceutical Sciences program. Methods: A total of 100 multiple-choice questions were generated using a publicly available large language model for a course dealing with common laboratory values. Two independent evaluators with extensive training and experience in writing multiple-choice questions evaluated each question for appropriate formatting, clarity, correctness, relevancy, and difficulty. For each question, a final dichotomous judgment was assigned by each reviewer, usable as written or not usable written. Results: The major finding of this study was that a generative language model (ChatGPT 3.5) could generate multiple-choice questions for assessing common laboratory value information but only about half the questions (50% and 57% for the 2 evaluators) were deemed usable without modification. General agreement between evaluator comments was common (62% of comments) with more than 1 correct answer being the most common reason for commenting on the lack of usability (N = 27). Conclusion: The generally positive findings of this study suggest that the use of a generative language model tool for developing examination questions is deserving of further investigation.
What problem does this paper attempt to address?