Abstract:Introduction: In the past year, the use of large language models (LLMs) has generated significant interest and excitement because of their potential to revolutionise various fields, including medical education for aspiring physicians. Although medical students undergo a demanding educational process to become competent health care professionals, the emergence of LLMs presents a promising solution to challenges like information overload, time constraints and pressure on clinical educators. However, integrating LLMs into medical education raises critical concerns and challenges for educators, professionals and students. This systematic review aims to explore LLM applications in medical education, specifically their impact on medical students' learning experiences. Methods: A systematic search was performed in PubMed, Web of Science and Embase for articles discussing the applications of LLMs in medical education using selected keywords related to LLMs and medical education, from the time of ChatGPT's debut until February 2024. Only articles available in full text or English were reviewed. The credibility of each study was critically appraised by two independent reviewers. Results: The systematic review identified 166 studies, of which 40 were found by review to be relevant to the study. Among the 40 relevant studies, key themes included LLM capabilities, benefits such as personalised learning and challenges regarding content accuracy. Importantly, 42.5% of these studies specifically evaluated LLMs in a novel way, including ChatGPT, in contexts such as medical exams and clinical/biomedical information, highlighting their potential in replicating human-level performance in medical knowledge. The remaining studies broadly discussed the prospective role of LLMs in medical education, reflecting a keen interest in their future potential despite current constraints. Conclusions: The responsible implementation of LLMs in medical education offers a promising opportunity to enhance learning experiences. However, ensuring information accuracy, emphasising skill-building and maintaining ethical safeguards are crucial. Continuous critical evaluation and interdisciplinary collaboration are essential for the appropriate integration of LLMs in medical education.

Comparing Scoring Consistency of Large Language Models with Faculty for Formative Assessments in Medical Education

Leveraging large language models to construct feedback from medical multiple-choice Questions

Large Language Models for Medical OSCE Assessment: A Novel Approach to Transcript Analysis

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Automatic Interactive Evaluation for Large Language Models with State Aware Patient Simulator

The Role of Large Language Models in Medical Education: Applications and Implications

Evaluating multiple large language models in pediatric ophthalmology

Impact of Large Language Models on Medical Education and Teaching Adaptations

Large Language Models as Partners in Student Essay Evaluation

A systematic review of large language models and their implications in medical education

Large Language Model-Driven Evaluation of Medical Records Using MedCheckLLM

A comparison of the diagnostic ability of large language models in challenging clinical cases

Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial

Evaluating large language models in medical applications: a survey

Fine-Tuning Large Language Models to Enhance Programmatic Assessment in Graduate Medical Education

An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models

Benchmarking the Confidence of Large Language Models in Clinical Questions

Evaluation of Radiology Residents' Reporting Skills Using Large Language Models: An Observational Study

Harnessing the potential of large language models in medical education: promise and pitfalls

Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring

Large language models for generating medical examinations: systematic review