Will code one day run a code? Performance of language models on ACEM primary examinations and implications

Jesse Smith,Philip MC Choi,Paul Buntine
DOI: https://doi.org/10.1111/1742-6723.14280
2023-07-07
EMA - Emergency Medicine Australasia
Abstract:Objective Large language models (LLMs) have demonstrated mixed results in their ability to pass various specialist medical examination and their performance within the field of emergency medicine remains unknown. Methods We explored the performance of three prevalent LLMs (OpenAI's GPT series, Google's Bard, and Microsoft's Bing Chat) on a practice ACEM primary examination. Results All LLMs achieved a passing score, with scores with GPT 4.0 outperforming the average candidate. Conclusion Large language models, by passing the ACEM primary examination, show potential as tools for medical education and practice. However, limitations exist and are discussed.
What problem does this paper attempt to address?