Abstract:Background: Generative artificial intelligence (AI) and large language models, such as OpenAI's ChatGPT, have shown promising potential in supporting medical education and clinical decision-making, given their vast knowledge base and natural language processing capabilities. As a general purpose AI system, ChatGPT can complete a wide range of tasks, including differential diagnosis without additional training. However, the specific application of ChatGPT in learning and applying a series of specialized, context-specific tasks mimicking the workflow of a human assessor, such as administering a standardized assessment questionnaire, followed by inputting assessment results in a standardized form, and interpretating assessment results strictly following credible, published scoring criteria, have not been thoroughly studied. Objective: This exploratory study aims to evaluate and optimize ChatGPT's capabilities in administering and interpreting the Sour Seven Questionnaire, an informant-based delirium assessment tool. Specifically, the objectives were to train ChatGPT-3.5 and ChatGPT-4 to understand and correctly apply the Sour Seven Questionnaire to clinical vignettes using prompt engineering, assess the performance of these AI models in identifying and scoring delirium symptoms against scores from human experts, and refine and enhance the models' interpretation and reporting accuracy through iterative prompt optimization. Methods: We used prompt engineering to train ChatGPT-3.5 and ChatGPT-4 models on the Sour Seven Questionnaire, a tool for assessing delirium through caregiver input. Prompt engineering is a methodology used to enhance the AI's processing of inputs by meticulously structuring the prompts to improve accuracy and consistency in outputs. In this study, prompt engineering involved creating specific, structured commands that guided the AI models in understanding and applying the assessment tool's criteria accurately to clinical vignettes. This approach also included designing prompts to explicitly instruct the AI on how to format its responses, ensuring they were consistent with clinical documentation standards. Results: Both ChatGPT models demonstrated promising proficiency in applying the Sour Seven Questionnaire to the vignettes, despite initial inconsistencies and errors. Performance notably improved through iterative prompt engineering, enhancing the models' capacity to detect delirium symptoms and assign scores. Prompt optimizations included adjusting the scoring methodology to accept only definitive "Yes" or "No" responses, revising the evaluation prompt to mandate responses in a tabular format, and guiding the models to adhere to the 2 recommended actions specified in the Sour Seven Questionnaire. Conclusions: Our findings provide preliminary evidence supporting the potential utility of AI models such as ChatGPT in administering standardized clinical assessment tools. The results highlight the significance of context-specific training and prompt engineering in harnessing the full potential of these AI models for health care applications. Despite the encouraging results, broader generalizability and further validation in real-world settings warrant additional research.

Text dialogue analysis Based ChatGPT for Primary Screening of Mild Cognitive Impairment

Text Dialogue Analysis for Primary Screening of Mild Cognitive Impairment: Development and Validation Study

Evaluating cognitive performance: Traditional methods vs. ChatGPT

GPT-4 and Neurologists in Screening for Mild Cognitive Impairment in the Elderly: A Comparative Analysis Study

Uncovering Language Disparity of ChatGPT in Healthcare: Non-English Clinical Environment for Retinal Vascular Disease Classification (Preprint)

A - 133 The Clinical Utility of Large Language Models in Diagnosing Neurocognitive Disorders among NACC Participants

The Clinical Utility of Large Language Models in Diagnosing Neurocognitive Disorders among NACC Participants

Performance Assessment of ChatGPT versus Bard in Detecting Alzheimer's Dementia

Performance Assessment of ChatGPT vs Bard in Detecting Alzheimer's Dementia

Uncovering Language Disparity of ChatGPT in Healthcare: Non-English Clinical Environment for Retinal Vascular Disease Classification

ChatGPT's Inconsistency in the Diagnosis of Alzheimer's Disease

ChatGPT Assisting Diagnosis of Neuro-ophthalmology Diseases Based on Case Reports

Design and development of the intelligent voice recognition‐based cognitive assessment WeChat mini‐program

Development and validity of computerized neuropsychological assessment devices for screening mild cognitive impairment: Ensemble of models with feature space heterogeneity and retrieval practice effect

Optimizing ChatGPT's Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study

Design and application of a game‐based WeChat mini‐program for screening cognitive impairments in Chinese older adults

Are Different Versions of ChatGPT's Ability Comparable to the Clinical Diagnosis Presented in Case Reports? A Descriptive Study

Evaluating Chatgpt As an Adjunct for Analyzing Challenging Case

Uncovering Language Disparity of ChatGPT on Retinal Vascular Disease Classification: Cross-Sectional Study

Selecting and Analyzing Speech Features for the Screening of Mild Cognitive Impairment

Evaluating the Efficacy of AI-Based Interactive Assessments Using Large Language Models for Depression Screening