Artificial Intelligence-Large Language Models (AI-LLMs) for Reliable and Accurate Cardiotocography (CTG) Interpretation in Obstetric Practice

Khanisyah Erza Gumilar,Manggala Pasca Wardhana,Muhammad Ilham Aldika Akbar,Agung Sunarko Putra,Dharma Putra Perjuangan Banjarnahor,Ryan Saktika Mulyana,Ita Fatati,Zih-Ying Yu,Yu-Cheng Hsu,Erry Gumilar Dachlan,Chien-Hsing Lu,Li-Na Liao,Ming Tan
DOI: https://doi.org/10.1101/2024.11.13.24317298
2024-11-15
Abstract:Abstract BACKGROUND: Accurate interpretation of Cardiotocography (CTG) is a critical tool for monitoring fetal well-being during pregnancy and labor, providing crucial insights into fetal heart rate and uterine contractions. Advanced artificial intelligence (AI) tools such as AI-Large Language Models (AI-LLMs) may enhance the accuracy of CTG interpretation, leading to better clinical outcomes. However, this potential has not yet been examined and reported yet. OBJECTIVE: This study aimed to evaluate the performance of three AI-LLMs (ChatGPT-4o, Gemini Advance, and Copilot) in interpreting CTG images, comparing their performance to junior and senior human doctors, and assessing their reliability in assisting clinical decisions. STUDY DESIGN: Seven CTG images were evaluated by three AI-LLMs, five senior doctors (SHD), and five junior doctors (JHD) and rated by five maternal-fetal medicine (MFM) experts (raters) using five parameters (relevance, clarity, depth, focus, and coherence). The raters were blinded to the source of interpretations, and a Likert scale was used to score the performance of each system. Statistical analysis assessed the homogeneity of expert ratings and the comparative performance of AI-LLMs and doctors. RESULTS: ChatGPT-4o outperformed the other AI models with a score of 77.86, much higher than Gemini Advance (57.14) and Copilot (47.29), as well as the junior doctors (JHD; 61.57). CG4o's performance (77.86) was only slightly below that of the senior doctor (SHD; 80.43), with no statistically significant differences between CG4o and SHD (p>0.05). Meanwhile, CG4o had the greatest score in the "depth" category, while the other four parameters were only marginally behind SHD. CONCLUSION: CG4o demonstrated outstanding performance in CTG interpretation, surpassing junior doctors and other AI-LLMs, while senior doctors remain superior in all groups. AI-LLMs, particularly CG4o, showed promising potential as valuable tools in clinical practice to assist obstetricians, enhance diagnostic accuracy, and improve patient care. KEYWORDS: Cardiotocography (CTG), Artificial Intelligence Large Language Models (AI-LLMs), ChatGPT, Gemini, Copilot, Fetal monitoring, Obstetrics
What problem does this paper attempt to address?