The Pulse of Artificial Intelligence in Cardiology: A Comprehensive Evaluation of State-of-the-art Large Language Models for Potential Use in Clinical Cardiology
Andrej Novak,Ivan Zeljkovic,Fran Rode,Ante Lisicic,Iskra Alexandra Nola,Nikola Pavlovic,Sime Manola
DOI: https://doi.org/10.1101/2023.08.08.23293689
2024-12-07
Abstract:Introduction: Over the past two years, the use of Large Language Models (LLMs) in clinical medicine has expanded significantly, particularly in cardiology, where they are applied to ECG interpretation, data analysis, and risk prediction. This study evaluates the performance of five advanced LLMs-Google Bard, GPT-3.5 Turbo, GPT-4.0, GPT-4o, and GPT-o1-mini-in responding to cardiology-specific questions of varying complexity.
Methods: A comparative analysis was conducted using four test sets of increasing difficulty, encompassing a range of cardiovascular topics, from prevention strategies to acute management and diverse pathologies. The models' responses were assessed for accuracy, understanding of medical terminology, clinical relevance, and adherence to guidelines by a panel of experienced cardiologists.
Results: All models demonstrated a foundational understanding of medical terminology but varied in clinical application and accuracy. GPT-4.0 exhibited superior performance, with accuracy rates of 92% (Set A), 88% (Set B), 80% (Set C), and 84% (Set D). GPT-4o and GPT-o1-mini closely followed, surpassing GPT-3.5 Turbo, which scored 83%, 64%, 67%, and 57%, and Google Bard, which achieved 79%, 60%, 50%, and 55%, respectively. Statistical analyses confirmed significant differences in performance across the models, particularly in the more complex test sets. While all models demonstrated potential for clinical application, their inability to reference ongoing clinical trials and some inconsistencies in guideline adherence highlight areas for improvement.
Conclusion: LLMs demonstrate considerable potential in interpreting and applying clinical guidelines to vignette-based cardiology queries, with GPT-4.0 leading in accuracy and guideline alignment. These tools offer promising avenues for augmenting clinical decision-making but should be used as complementary aids under professional supervision.