Abstract:Background: Dermatologic patient education materials (PEMs) are often written above the national average seventh- to eighth-grade reading level. ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT are large language models (LLMs) that are responsive to user prompts. Our project assesses their use in generating dermatologic PEMs at specified reading levels. Objective: This study aims to assess the ability of select LLMs to generate PEMs for common and rare dermatologic conditions at unspecified and specified reading levels. Further, the study aims to assess the preservation of meaning across such LLM-generated PEMs, as assessed by dermatology resident trainees. Methods: The Flesch-Kincaid reading level (FKRL) of current American Academy of Dermatology PEMs was evaluated for 4 common (atopic dermatitis, acne vulgaris, psoriasis, and herpes zoster) and 4 rare (epidermolysis bullosa, bullous pemphigoid, lamellar ichthyosis, and lichen planus) dermatologic conditions. We prompted ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT to "Create a patient education handout about [condition] at a [FKRL]" to iteratively generate 10 PEMs per condition at unspecified fifth- and seventh-grade FKRLs, evaluated with Microsoft Word readability statistics. The preservation of meaning across LLMs was assessed by 2 dermatology resident trainees. Results: The current American Academy of Dermatology PEMs had an average (SD) FKRL of 9.35 (1.26) and 9.50 (2.3) for common and rare diseases, respectively. For common diseases, the FKRLs of LLM-produced PEMs ranged between 9.8 and 11.21 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). For rare diseases, the FKRLs of LLM-produced PEMs ranged between 9.85 and 11.45 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). At the fifth-grade reading level, GPT-4 was better at producing PEMs for both common and rare conditions than ChatGPT-3.5 (P=.001 and P=.01, respectively), DermGPT (P<.001 and P=.03, respectively), and DocsGPT (P<.001 and P=.02, respectively). At the seventh-grade reading level, no significant difference was found between ChatGPT-3.5, GPT-4, DocsGPT, or DermGPT in producing PEMs for common conditions (all P>.05); however, for rare conditions, ChatGPT-3.5 and DocsGPT outperformed GPT-4 (P=.003 and P<.001, respectively). The preservation of meaning analysis revealed that for common conditions, DermGPT ranked the highest for overall ease of reading, patient understandability, and accuracy (14.75/15, 98%); for rare conditions, handouts generated by GPT-4 ranked the highest (14.5/15, 97%). Conclusions: GPT-4 appeared to outperform ChatGPT-3.5, DocsGPT, and DermGPT at the fifth-grade FKRL for both common and rare conditions, although both ChatGPT-3.5 and DocsGPT performed better than GPT-4 at the seventh-grade FKRL for rare conditions. LLM-produced PEMs may reliably meet seventh-grade FKRLs for select common and rare dermatologic conditions and are easy to read, understandable for patients, and mostly accurate. LLMs may play a role in enhancing health literacy and disseminating accessible, understandable PEMs in dermatology.

GPT-4 Improves Readability of Institutional Heart Failure Patient Education Materials

Evaluating ChatGPT platform in delivering heart failure educational material: A comparison with the leading national cardiology institutes

Enhancing Health Literacy: Evaluating the Readability of Patient Handouts Revised by ChatGPT's Large Language Model

Advancing Patient Education in Idiopathic Intracranial Hypertension: The Promise of Large Language Models

Using Large Language Models to Generate Educational Materials on Childhood Glaucoma

Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources

Improving readability and comprehension levels of otolaryngology patient education materials using ChatGPT

Assessing the Application of Large Language Models in Generating Dermatologic Patient Education Materials According to Reading Level: Qualitative Study

Readability and Accessibility of Patient-Education Materials for Heart Failure in the United States

ChatGPT as a medical education resource in cardiology: Mitigating replicability challenges and optimizing model performance

PRO-READ IR:Enhanced PROcedural Information READability for Patient-Centered Care in Interventional Radiology with Large Language Models

Prompt engineering with ChatGPT3.5 and GPT4 to improve patient education on retinal diseases

Expanding Accessibility in Cleft Care: The Role of Artificial Intelligence in Improving Literacy of Alveolar Bone Grafting Information

Not So Patient‐Friendly: Patient Education Materials in Rheumatology and Internal Medicine Fall Short of Nationally Recommended Readability Benchmarks in the United States

Large language models: a new frontier in paediatric cataract patient education

Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study

Using ChatGPT to Improve Readability of Interventional Radiology Procedure Descriptions

The Use of Large Language Models to Generate Education Materials about Uveitis

How readable the online patient education materials of intensive and critical care societies: Assessment of the readability

Empowering patients: how accurate and readable are large language models in renal cancer education

The Use of Artificial Intelligence to Improve Readability of Otolaryngology Patient Education Materials