Abstract:Large language models (LLMs) find increasing applications in many fields. Here, three LLM chatbots (ChatGPT-3.5, ChatGPT-4 and Bard) are assessed - in their current form, as publicly available - for their ability to recognize Alzheimer's Dementia (AD) and Cognitively Normal (CN) individuals using textual input derived from spontaneous speech recordings. Zero-shot learning approach is used at two levels of independent queries, with the second query (chain-of-thought prompting) eliciting more detailed than the first. Each LLM chatbot's performance is evaluated on the prediction generated in terms of accuracy, sensitivity, specificity, precision and F1 score. LLM chatbots generated three-class outcome ("AD", "CN", or "Unsure"). When positively identifying AD, Bard produced highest true-positives (89% recall) and highest F1 score (71%), but tended to misidentify CN as AD, with high confidence (low "Unsure" rates); for positively identifying CN, GPT-4 resulted in the highest true-negatives at 56% and highest F1 score (62%), adopting a diplomatic stance (moderate "Unsure" rates). Overall, three LLM chatbots identify AD vs CN surpassing chance-levels but do not currently satisfy clinical application.

What problem does this paper attempt to address?

The paper attempts to address the problem of evaluating the performance of large language models (LLMs) in identifying individuals with Alzheimer's disease (AD) and cognitively normal (CN) individuals. Specifically, the researchers evaluated the ability of three LLM chatbots (ChatGPT-3.5, ChatGPT-4, and Bard) to identify AD and CN individuals under zero-shot learning conditions by analyzing text inputs from spontaneous speech recordings. ### Main Issues: 1. **Identification Ability**: Can these LLM chatbots effectively distinguish between AD and CN individuals? 2. **Performance Metrics**: What are the accuracy, sensitivity, specificity, precision, and F1 scores of these chatbots? 3. **Uncertainty Handling**: Do these chatbots provide an "Unsure" classification when uncertain? 4. **Impact of Different Query Methods**: Does the type of query (e.g., chain-of-thought prompting) affect the chatbots' identification performance? ### Research Background: - **Existing Methods**: Current methods using audio, text, or their fusion features for AD and CN classification have achieved accuracies ranging from 78.9% to 88.7%. - **Zero-Shot Learning**: Unlike supervised learning methods, zero-shot learning does not require labeled training data, making it more flexible in practical applications but also more challenging. ### Research Methods: - **Dataset**: The audio dataset provided by the ADReSSo Challenge, which includes recordings of 36 cognitively normal individuals and 35 Alzheimer's disease patients, was used. - **Text Transcription**: Audio recordings were transcribed into text to serve as input for the chatbots. - **Query Methods**: - **Query 1**: Directly ask whether the recording is from a cognitively normal or Alzheimer's disease patient. - **Query 2**: Ask the chatbot to analyze grammar, vocabulary, structure, narrative style, etc., and provide a detailed classification opinion. ### Main Findings: - **Overall Performance**: All three chatbots performed above random levels in identifying AD and CN but have not yet reached clinical application standards. - **Specific Performance**: - **Bard**: Performed best in identifying AD with a true positive rate of 89% and an F1 score of 71%, but tended to misclassify CN as AD. - **GPT-4**: Performed best in identifying CN with a true negative rate of 56% and an F1 score of 62%, taking a more cautious approach and providing more "Unsure" classifications. - **GPT-3.5**: Performed well in identifying AD but had average performance in identifying CN. ### Conclusion: - **Potential and Limitations**: While these LLM chatbots show some potential in identifying AD and CN, their current performance is not sufficient for clinical application. Future research could explore more complex query methods and more training data to improve their identification accuracy.

Performance Assessment of ChatGPT vs Bard in Detecting Alzheimer's Dementia

Performance Assessment of ChatGPT versus Bard in Detecting Alzheimer's Dementia

A - 133 The Clinical Utility of Large Language Models in Diagnosing Neurocognitive Disorders among NACC Participants

Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5, and Humans in Clinical Chemistry Multiple-Choice Questions

Can ChatGPT and Bard Generate Aligned Assessment Items? A Reliability Analysis against Human Performance

[Arginine decarboxyoxidase. II. Oxidation of canavanine and homoarginine to beta-guanidoxy-propionamide and delta-guanidovaleramide].

Text dialogue analysis Based ChatGPT for Primary Screening of Mild Cognitive Impairment

Performance of ChatGPT and Bard in self-assessment questions for nephrology board renewal

Use of large language model-based chatbots in managing the rehabilitation concerns and education needs of outpatient stroke survivors and caregivers

Evaluating cognitive performance: Traditional methods vs. ChatGPT

The treatment of Sjögren's disease in NZB/NZW F1 hybrid mice with azathioprine: a two-stage study.

Alzheimer's disease recognition from spontaneous speech using large language models

Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology

The Clinical Utility of Large Language Models in Diagnosing Neurocognitive Disorders among NACC Participants

ChatGPT's Inconsistency in the Diagnosis of Alzheimer's Disease

Can LLMs like GPT-4 outperform traditional AI tools in dementia diagnosis? Maybe, but not today

Evaluation of ChatGPT Family of Models for Biomedical Reasoning and Classification

Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment

ChatGPT Assisting Diagnosis of Neuro-ophthalmology Diseases Based on Case Reports

Large language models in pathology: A comparative study of ChatGPT and bard with pathology trainees on multiple-choice questions

Overlapping of the coding regions for alpha and gamma components of penicillin-binding protein 1 b in Escherichia coli.