Abstract:Objectives: As a type of artificial intelligence (AI), the large language model (LLM) is designed to understand and generate human-like fluent texts. Typical LLMs, e.g., GPT-3.5, GPT-4, and GPT-4o, interact with users through “prompts” and some internal parameters, like “temperature.” Currently, some AI models have been widely used in the field of psychiatry, but systemic reports examining the capacity and suitability of LLM in detecting psychiatry diagnoses are still lacking. In this study, we intended to explore the performances of different generations of LLMs with different levels of temperature in detecting mental illnesses from electronic medical records (EMRs). Methods: We collected 500 Chinese EMRs from one mental hospital in northern Taiwan, with the “current medical history” section as corpuses. We used the GPT-3.5-turbo-16K, GPT-4, and GPT-4o models provided by Microsoft’s Azure OpenAI service (www.portal.azure.com) to generate AI-based predictions (the probability) for the diagnoses of major depressive disorder (MDD), schizophrenia (SCZ), attention-deficit/hyperactivity disorder (ADHD), and autistic spectrum disorder (ASD). Clinic diagnoses made by qualified psychiatrists were treated as gold standards (target) of receiver operating characteristic curve analysis. Then, their area under the ROC curve (AUCs) were compared using the DeLong test. Results: Among 500 recruited Chinese EMRs in this study, 56.6% were primarily diagnosed with MDD, as well as 22.4% with SCZ, 11.2% with ADHD, and 9.2% with ASD. In general, our LLMs achieved AUCs of 0.84 to 0.98 for detecting four different diagnoses. There were no significant differences between versions, but newer versions (GPT-4o models with AUCs of 0.98–0.97 for SCZ, ADHD, and ASD) performed better than older versions (GPT-3.5 models with AUCs of 0.88–0.96) except for MDD (AUC of 0.95 for GPT-4 and AUC of 0.93 for GPT-4o). Although DeLong tests showed nonsignificant differences between the AUCs of models with different levels of temperature, models with zero temperatures generally represented the best performances in magnitudes. Conclusion: To the best of our knowledge, this study is the first to demonstrate that LLMs performed excellently in distinguishing some mental illnesses. Nevertheless, the diagnostic capabilities of LLMs differed from other diagnoses such as MDD. We hypothesize that this phenomenon may partially result from the complexity of symptomology and/or the content filtering rules of OpenAI. Therefore, more advanced models, e.g., GPT-5, or private training models, e.g., Llamma 3, with the relevance generative answering technique, are expected to answer our questions.

Taming Large Language Models to Implement Diagnosis and Evaluating the Generation of Llms at the Semantic Similarity Level in Acupuncture and Moxibustion

TCMChat: A Generative Large Language Model for Traditional Chinese Medicine

Relation Extraction Using Large Language Models: A Case Study on Acupuncture Point Locations

Lingdan: enhancing encoding of traditional Chinese medicine knowledge for clinical reasoning tasks with large language models

Exploring Large Language Models for Acronym, Symbol Sense Disambiguation, and Semantic Similarity and Relatedness Assessment

A Survey of Large Language Models in Medicine: Progress, Application, and Challenge

ZhongJing: A Locally Deployed Large Language Model for Traditional Chinese Medicine and Corresponding Evaluation Methodology: A Large Language Model for data fine-tuning in the field of Traditional Chinese Medicine, and a new evaluation method called TCMEval are proposed

Exploring the Comprehension of ChatGPT in Traditional Chinese Medicine Knowledge

The application of large language models in medicine: A scoping review

Large language models in health care: Development, applications, and challenges

A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine

Evaluating Large Language Models for Radiology Natural Language Processing

Understanding natural language: Potential application of large language models to ophthalmology

Large Language Models for Disease Diagnosis: A Scoping Review

Performances of Large Language Models in Detecting Psychiatric Diagnoses from Chinese Electronic Medical Records: Comparisons between GPT-3.5, GPT-4, and GPT-4o

Development and evaluation of a large language model of ophthalmology in Chinese

Research on the model of automatic recognition and natural language question-answer system for traditional Chinese medicine tongue images based on LLMs

Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis

DoctorGPT: A Large Language Model with Chinese Medical Question-Answering Capabilities

Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries

A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry