Tipta uzmanlik sinavinda (tus) buyuk dil modelleri insanlardan daha mi basarili?

Yesim Aygul,Muge Olucoglu,Adil Alpkocak
2024-08-27
Abstract:The potential of artificial intelligence in medical education and assessment has been made evident by recent developments in natural language processing and artificial intelligence. Medical questions can now be successfully answered by artificial intelligence algorithms. It can help medical practitioners. This study evaluates the performance of three different artificial intelligence models in answering Turkish medical questions in the 2021 1st Term Medical Specialization Examination (MSE). MSE consists of a total of 240 questions across clinical (CMST) and basic (BMST) medical sciences. According to the results in CMST, it was concluded that Gemini correctly answered 82 questions, ChatGPT-4 answered 105 questions and ChatGPT-4o answered 117 questions. In BMST, Gemini and ChatGPT-4 answered 93 questions and ChatGPT-4o answered 107 questions correctly according to the answer key. ChatGPT-4o outperformed the candidate with the highest scores of 113 and 106 according to CMST and BMST respectively. This study highlights the importance of the potential of artificial intelligence in medical education and assessment. It demonstrates that advanced models can achieve high accuracy and contextual understanding, demonstrating their potential role in medical education and evaluation.
Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to evaluate the performance of three different large - language models (Gemini, ChatGPT - 4 and ChatGPT - 4o) in answering the 2021 Turkish Medical Specialty Examination (TUS). Specifically, the paper attempts to answer the following questions: 1. **Model accuracy**: How accurate are these large - language models in answering the Clinical Medical Science Test (KTBT) and the Basic Medical Science Test (TTBT)? 2. **Comparison with human performance**: Do these models perform better than human examinees taking the same exam? 3. **Comparison between models**: What are the differences in performance between different models? Which models perform better on specific types of questions? Through these questions, the paper hopes to demonstrate the potential of large - language models in medical education and assessment and explore possible future application scenarios for these models.