Emre Sezgin,Daniel I Jackson,A Baki Kocaballi,Mindy Bibart,Sue Zupanec,Wendy Landier,Anthony Audino,Mark Ranalli,Micah Skeens
Abstract:Background and Objectives: In pediatric oncology, caregivers seek detailed, accurate, and understandable information about their child's condition, treatment, and side effects. The primary aim of this study was to assess the performance of four publicly accessible large language model (LLM) supported knowledge generation and search tools in providing valuable and reliable information to caregivers of children with cancer.
Methods: This cross-sectional study evaluated the performance of the four LLM-supported tools- ChatGPT (GPT-4), Google Bard (Gemini Pro), Microsoft Bing Chat, and Google SGE- against a set of frequently asked questions (FAQs) derived from the Children's Oncology Group Family Handbook and expert input. Five pediatric oncology experts assessed the generated LLM responses using measures including Accuracy (3-point ordinal scale), Clarity (3-point ordinal scale), Inclusivity (3-point ordinal scale), Completeness (Dichotomous nominal scale), Clinical Utility (5-point Likert-scale), and Overall Rating (4-point ordinal scale). Additional Content Quality Criteria such as Readability (ordinal scale; 5-18th grade of educated reading), Presence of AI Disclosure (Dichotomous scale), Source Credibility (3-point interval scale) , Resource Matching (3-point ordinal scale), and Content Originality (ratio scale) were also evaluated. We used descriptive analysis including the mean, standard deviation, median, and interquartile range. We conducted Shapiro-Wilk test for normality, Levene's test for homogeneity of variances, and Kruskal-Wallis H-Tests and Dunn's post-hoc tests for pairwise comparisons.
Results: Through expert evaluation, ChatGPT showed high performance in accuracy (M=2.71, SD=0.235), clarity (M=2.73, SD=0.271), completeness (M=0.815, SD=0.203), Clinical Utility (M=3.81, SD=0.544), and Overall Rating (M=3.13, SD=0.419). Bard also performed well, especially in accuracy (M=2.56, SD=0.400) and clarity (M=2.54, SD=0.411), while Bing Chat (Accuracy M=2.33, SD=0.456; Clarity M=2.29, SD=0.424) and Google SGE (Accuracy M=2.08, SD=0.552; Clarity M=1.95, SD=0.541) had lower overall scores. The Presence of AI Disclosure was less frequent in ChatGPT (M=0.69, SD=0.46), which affected Clarity (M=2.73, SD=0.266), whereas Bard maintained a balance between AI Disclosure (M=0.92, SD=0.27) and Clarity (M=2.54, SD=0.403). Overall, we observed significant differences between LLM tools (p < .01).
Conclusions: LLM-supported tools potentially contribute to caregivers' knowledge of pediatric oncology on related topics. Each model has unique strengths and areas for improvement, suggesting the need for careful selection and evaluation based on specific clinical contexts. Further research is needed to explore the application of these tools in other medical specialties and patient demographics to assess their broader applicability and long-term impacts, including the usability and feasibility of using LLM-supported tools with caregivers.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This research aims to evaluate the performance of large language model (LLM) - supported knowledge generation and search tools in providing valuable and reliable information for caregivers of children with cancer. Specifically, the main objectives of the study are:
1. **Evaluating the performance of LLM tools**: Through expert evaluation, compare the accuracy, clarity, inclusiveness, completeness, clinical practicality, and overall scores of four publicly available LLM tools (ChatGPT, Google Bard, Microsoft Bing Chat, Google SGE) when answering frequently asked questions (FAQs) of children with cancer and their caregivers.
2. **Exploring the application potential of LLM tools in pediatric oncology**: Investigate whether these tools can provide useful and reliable information for caregivers, thereby helping them better understand their children's conditions, treatment plans, and side effects.
3. **Identifying the advantages and areas for improvement of LLM tools**: Through detailed evaluation, determine the unique advantages and areas for improvement of each tool for selection and optimization in specific clinical settings.
### Research background
In pediatric oncology, caregivers usually seek detailed, accurate, and easy - to - understand information to understand their children's conditions, treatment plans, and possible side effects. This information is crucial for caregivers' understanding and decision - making. However, traditional information sources (such as brochures provided by hospitals) may not provide personalized feedback, causing caregivers to rely on informal channels such as the Internet to obtain information. Although these channels are convenient and easily accessible, there is a risk of inaccurate or misleading information.
### Research methods
1. **Model selection**: The research team selected four publicly available and popular LLM tools, namely ChatGPT, Google Bard, Microsoft Bing Chat, and Google SGE.
2. **Question creation**: The research team created 26 frequently asked questions (FAQs) based on common problems in pediatric oncology, combined with the COG family handbook and expert opinions, covering different stages such as pre - diagnosis, diagnosis, treatment, and rehabilitation.
3. **Response generation**: Each LLM tool was sequentially prompted with these 26 questions and generated corresponding answers. The generated responses were recorded for expert evaluation.
4. **Expert evaluation**: Five pediatric oncology experts evaluated the responses generated by each LLM tool. The evaluation criteria included accuracy, clarity, inclusiveness, completeness, clinical practicality, and overall scores. In addition, content quality indicators such as readability, AI disclosure, source credibility, resource matching, and content originality were also evaluated.
### Results
Through expert evaluation, ChatGPT performed well in terms of accuracy (M = 2.71, SD = 0.235), clarity (M = 2.73, SD = 0.271), completeness (M = 0.815, SD = 0.203), clinical practicality (M = 3.81, SD = 0.544), and overall score (M = 3.13, SD = 0.419). Google Bard also performed well in terms of accuracy (M = 2.56, SD = 0.400) and clarity (M = 2.54, SD = 0.411), while Bing Chat and Google SGE had lower overall scores.
### Conclusion
LLM - supported tools have potential value in providing useful and reliable information for caregivers of children with cancer. Each model has its own unique advantages and areas for improvement, so careful selection and evaluation are required in specific clinical settings. Future research should further explore the application of these tools in other medical specialties and patient groups to evaluate their broader application prospects and long - term impacts.