Geriatrics and artificial intelligence in Spain (Ger-IA project): talking to ChatGPT, a nationwide survey

Daniel Rosselló-Jiménez,S. Docampo,Y. Collado,L. Cuadra-Llopart,F. Riba,M. Llonch-Masriera
DOI: https://doi.org/10.1007/s41999-024-00970-7
2024-04-15
European Geriatric Medicine
Abstract:To study doctor's degree of agreement with an artificial intelligence tool (ChatGPT) that provided answers to different problems or situations in geriatric medicine. Specialists rated ChatGPT answers lower than those residents. Answers from questions related to general or theoretical aspects obtained higher mean scores, while those related to clinical complex decisions obtained lower scores. ChatGPT could be a good tool for generating hypotheses and ordering and articulating ideas, but it is still far from being used for medical decision-making in our context. The purposes of the study was to describe the degree of agreement between geriatricians with the answers given by an AI tool (ChatGPT) in response to questions related to different areas in geriatrics, to study the differences between specialists and residents in geriatrics in terms of the degree of agreement with ChatGPT, and to analyse the mean scores obtained by areas of knowledge/domains. An observational study was conducted involving 126 doctors from 41 geriatric medicine departments in Spain. Ten questions about geriatric medicine were posed to ChatGPT, and doctors evaluated the AI's answers using a Likert scale. Sociodemographic variables were included. Questions were categorized into five knowledge domains, and means and standard deviations were calculated for each. 130 doctors answered the questionnaire. 126 doctors (69.8% women, mean age 41.4 [9.8]) were included in the final analysis. The mean score obtained by ChatGPT was 3.1/5 [0.67]. Specialists rated ChatGPT lower than residents (3.0/5 vs. 3.3/5 points, respectively, P < 0.05). By domains, ChatGPT ​​scored better (M: 3.96; SD: 0.71) in general/theoretical questions rather than in complex decisions/end-of-life situations (M: 2.50; SD: 0.76) and answers related to diagnosis/performing of complementary tests obtained the lowest ones (M: 2.48; SD: 0.77). Scores presented big variability depending on the area of knowledge. Questions related to theoretical aspects of challenges/future in geriatrics obtained better scores. When it comes to complex decision-making, appropriateness of the therapeutic efforts or decisions about diagnostic tests, professionals indicated a poorer performance. AI is likely to be incorporated into some areas of medicine, but it would still present important limitations, mainly in complex medical decision-making.
geriatrics & gerontology,gerontology
What problem does this paper attempt to address?