Abstract:Background: Although uncertainties exist regarding implementation, artificial intelligence-driven generative language models (GLMs) have enormous potential in medicine. Deployment of GLMs could improve patient comprehension of clinical texts and improve low health literacy. Objective: The goal of this study is to evaluate the potential of ChatGPT-3.5 and GPT-4 to tailor the complexity of medical information to patient-specific input education level, which is crucial if it is to serve as a tool in addressing low health literacy. Methods: Input templates related to 2 prevalent chronic diseases-type II diabetes and hypertension-were designed. Each clinical vignette was adjusted for hypothetical patient education levels to evaluate output personalization. To assess the success of a GLM (GPT-3.5 and GPT-4) in tailoring output writing, the readability of pre- and posttransformation outputs were quantified using the Flesch reading ease score (FKRE) and the Flesch-Kincaid grade level (FKGL). Results: Responses (n=80) were generated using GPT-3.5 and GPT-4 across 2 clinical vignettes. For GPT-3.5, FKRE means were 57.75 (SD 4.75), 51.28 (SD 5.14), 32.28 (SD 4.52), and 28.31 (SD 5.22) for 6th grade, 8th grade, high school, and bachelor's, respectively; FKGL mean scores were 9.08 (SD 0.90), 10.27 (SD 1.06), 13.4 (SD 0.80), and 13.74 (SD 1.18). GPT-3.5 only aligned with the prespecified education levels at the bachelor's degree. Conversely, GPT-4's FKRE mean scores were 74.54 (SD 2.6), 71.25 (SD 4.96), 47.61 (SD 6.13), and 13.71 (SD 5.77), with FKGL mean scores of 6.3 (SD 0.73), 6.7 (SD 1.11), 11.09 (SD 1.26), and 17.03 (SD 1.11) for the same respective education levels. GPT-4 met the target readability for all groups except the 6th-grade FKRE average. Both GLMs produced outputs with statistically significant differences (P<.001; 8th grade P<.001; high school P<.001; bachelors P=.003; FKGL: 6th grade P=.001; 8th grade P<.001; high school P<.001; bachelors P<.001) between mean FKRE and FKGL across input education levels. Conclusions: GLMs can change the structure and readability of medical text outputs according to input-specified education. However, GLMs categorize input education designation into 3 broad tiers of output readability: easy (6th and 8th grade), medium (high school), and difficult (bachelor's degree). This is the first result to suggest that there are broader boundaries in the success of GLMs in output text simplification. Future research must establish how GLMs can reliably personalize medical texts to prespecified education levels to enable a broader impact on health care literacy.

Bridging Language Gaps in Neurology Patient Education Through Large Language Models: a Comparative Analysis of ChatGPT, Gemini, and Claude

Uncovering Language Disparity of ChatGPT in Healthcare: Non-English Clinical Environment for Retinal Vascular Disease Classification (Preprint)

Investigating the capabilities of advanced large language models in generating patient instructions and patient educational material

Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions

A comparison of the diagnostic ability of large language models in challenging clinical cases

Clinical Accuracy, Relevance, Clarity, and Emotional Sensitivity of Large Language Models to Surgical Patient Questions: Cross-Sectional Study

A Study of Generative Large Language Model for Medical Research and Healthcare

Evaluating large language models on medical, lay language, and self-reported descriptions of genetic conditions

ChatGPT Assisting Diagnosis of Neuro-ophthalmology Diseases Based on Case Reports

Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries

The Potential Impact of Large Language Models on Doctor-Patient Communication: A Case Study in Prostate Cancer

Testing the Limits of Language Models: A Conversational Framework for Medical AI Assessment

The Role of Large Language Models in Medical Education: Applications and Implications

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study

Benchmarking State-of-the-Art Large Language Models for Migraine Patient Education: Performance Comparison of Responses to Common Queries

Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis

Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study

Evaluation of large language models in breast cancer clinical scenarios: A comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2

The utility of ChatGPT as a generative medical translator

Evaluating multiple large language models in pediatric ophthalmology