Performance Assessment of ChatGPT vs Bard in Detecting Alzheimer's Dementia

Balamurali B T,Jer-Ming Chen
2024-01-30
Abstract:Large language models (LLMs) find increasing applications in many fields. Here, three LLM chatbots (ChatGPT-3.5, ChatGPT-4 and Bard) are assessed - in their current form, as publicly available - for their ability to recognize Alzheimer's Dementia (AD) and Cognitively Normal (CN) individuals using textual input derived from spontaneous speech recordings. Zero-shot learning approach is used at two levels of independent queries, with the second query (chain-of-thought prompting) eliciting more detailed than the first. Each LLM chatbot's performance is evaluated on the prediction generated in terms of accuracy, sensitivity, specificity, precision and F1 score. LLM chatbots generated three-class outcome ("AD", "CN", or "Unsure"). When positively identifying AD, Bard produced highest true-positives (89% recall) and highest F1 score (71%), but tended to misidentify CN as AD, with high confidence (low "Unsure" rates); for positively identifying CN, GPT-4 resulted in the highest true-negatives at 56% and highest F1 score (62%), adopting a diplomatic stance (moderate "Unsure" rates). Overall, three LLM chatbots identify AD vs CN surpassing chance-levels but do not currently satisfy clinical application.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the problem of evaluating the performance of large language models (LLMs) in identifying individuals with Alzheimer's disease (AD) and cognitively normal (CN) individuals. Specifically, the researchers evaluated the ability of three LLM chatbots (ChatGPT-3.5, ChatGPT-4, and Bard) to identify AD and CN individuals under zero-shot learning conditions by analyzing text inputs from spontaneous speech recordings. ### Main Issues: 1. **Identification Ability**: Can these LLM chatbots effectively distinguish between AD and CN individuals? 2. **Performance Metrics**: What are the accuracy, sensitivity, specificity, precision, and F1 scores of these chatbots? 3. **Uncertainty Handling**: Do these chatbots provide an "Unsure" classification when uncertain? 4. **Impact of Different Query Methods**: Does the type of query (e.g., chain-of-thought prompting) affect the chatbots' identification performance? ### Research Background: - **Existing Methods**: Current methods using audio, text, or their fusion features for AD and CN classification have achieved accuracies ranging from 78.9% to 88.7%. - **Zero-Shot Learning**: Unlike supervised learning methods, zero-shot learning does not require labeled training data, making it more flexible in practical applications but also more challenging. ### Research Methods: - **Dataset**: The audio dataset provided by the ADReSSo Challenge, which includes recordings of 36 cognitively normal individuals and 35 Alzheimer's disease patients, was used. - **Text Transcription**: Audio recordings were transcribed into text to serve as input for the chatbots. - **Query Methods**: - **Query 1**: Directly ask whether the recording is from a cognitively normal or Alzheimer's disease patient. - **Query 2**: Ask the chatbot to analyze grammar, vocabulary, structure, narrative style, etc., and provide a detailed classification opinion. ### Main Findings: - **Overall Performance**: All three chatbots performed above random levels in identifying AD and CN but have not yet reached clinical application standards. - **Specific Performance**: - **Bard**: Performed best in identifying AD with a true positive rate of 89% and an F1 score of 71%, but tended to misclassify CN as AD. - **GPT-4**: Performed best in identifying CN with a true negative rate of 56% and an F1 score of 62%, taking a more cautious approach and providing more "Unsure" classifications. - **GPT-3.5**: Performed well in identifying AD but had average performance in identifying CN. ### Conclusion: - **Potential and Limitations**: While these LLM chatbots show some potential in identifying AD and CN, their current performance is not sufficient for clinical application. Future research could explore more complex query methods and more training data to improve their identification accuracy.