Abstract:Background: Dermatologic patient education materials (PEMs) are often written above the national average seventh- to eighth-grade reading level. ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT are large language models (LLMs) that are responsive to user prompts. Our project assesses their use in generating dermatologic PEMs at specified reading levels. Objective: This study aims to assess the ability of select LLMs to generate PEMs for common and rare dermatologic conditions at unspecified and specified reading levels. Further, the study aims to assess the preservation of meaning across such LLM-generated PEMs, as assessed by dermatology resident trainees. Methods: The Flesch-Kincaid reading level (FKRL) of current American Academy of Dermatology PEMs was evaluated for 4 common (atopic dermatitis, acne vulgaris, psoriasis, and herpes zoster) and 4 rare (epidermolysis bullosa, bullous pemphigoid, lamellar ichthyosis, and lichen planus) dermatologic conditions. We prompted ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT to "Create a patient education handout about [condition] at a [FKRL]" to iteratively generate 10 PEMs per condition at unspecified fifth- and seventh-grade FKRLs, evaluated with Microsoft Word readability statistics. The preservation of meaning across LLMs was assessed by 2 dermatology resident trainees. Results: The current American Academy of Dermatology PEMs had an average (SD) FKRL of 9.35 (1.26) and 9.50 (2.3) for common and rare diseases, respectively. For common diseases, the FKRLs of LLM-produced PEMs ranged between 9.8 and 11.21 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). For rare diseases, the FKRLs of LLM-produced PEMs ranged between 9.85 and 11.45 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). At the fifth-grade reading level, GPT-4 was better at producing PEMs for both common and rare conditions than ChatGPT-3.5 (P=.001 and P=.01, respectively), DermGPT (P<.001 and P=.03, respectively), and DocsGPT (P<.001 and P=.02, respectively). At the seventh-grade reading level, no significant difference was found between ChatGPT-3.5, GPT-4, DocsGPT, or DermGPT in producing PEMs for common conditions (all P>.05); however, for rare conditions, ChatGPT-3.5 and DocsGPT outperformed GPT-4 (P=.003 and P<.001, respectively). The preservation of meaning analysis revealed that for common conditions, DermGPT ranked the highest for overall ease of reading, patient understandability, and accuracy (14.75/15, 98%); for rare conditions, handouts generated by GPT-4 ranked the highest (14.5/15, 97%). Conclusions: GPT-4 appeared to outperform ChatGPT-3.5, DocsGPT, and DermGPT at the fifth-grade FKRL for both common and rare conditions, although both ChatGPT-3.5 and DocsGPT performed better than GPT-4 at the seventh-grade FKRL for rare conditions. LLM-produced PEMs may reliably meet seventh-grade FKRLs for select common and rare dermatologic conditions and are easy to read, understandable for patients, and mostly accurate. LLMs may play a role in enhancing health literacy and disseminating accessible, understandable PEMs in dermatology.

Advancing Patient Education in Idiopathic Intracranial Hypertension: The Promise of Large Language Models

Using Large Language Models to Generate Educational Materials on Childhood Glaucoma

Large language models: a new frontier in paediatric cataract patient education

Assessing the Application of Large Language Models in Generating Dermatologic Patient Education Materials According to Reading Level: Qualitative Study

Enhancing Health Literacy: Evaluating the Readability of Patient Handouts Revised by ChatGPT's Large Language Model

Investigating the capabilities of advanced large language models in generating patient instructions and patient educational material

The Use of Large Language Models to Generate Education Materials about Uveitis

Artificial Intelligence-Generated Patient Education Materials for Helicobacter pylori Infection: A Comparative Analysis

Empowering patients: how accurate and readable are large language models in renal cancer education

Prompt engineering with ChatGPT3.5 and GPT4 to improve patient education on retinal diseases

GPT-4 Improves Readability of Institutional Heart Failure Patient Education Materials

Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study

Improving readability and comprehension levels of otolaryngology patient education materials using ChatGPT

ChatGPT-3.5, ChatGPT-4, Google Bard, and Microsoft Bing to Improve Health Literacy and Communication in Pediatric Populations and Beyond

Prompt matters: evaluation of large language model chatbot responses related to Peyronie's disease

PRO-READ IR:Enhanced PROcedural Information READability for Patient-Centered Care in Interventional Radiology with Large Language Models

Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models

Prompt matters: evaluation of large language model chatbot responses related to Peyronie’s disease

Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources

The Use of Artificial Intelligence to Improve Readability of Otolaryngology Patient Education Materials

Expanding Accessibility in Cleft Care: The Role of Artificial Intelligence in Improving Literacy of Alveolar Bone Grafting Information