Crystal T. Chang,Hodan Farah,Haiwen Gui,Shawheen Justin Rezaei,Charbel Bou-Khalil,Ye-Jean Park,Akshay Swaminathan,Jesutofunmi A. Omiye,Akaash Kolluri,Akash Chaurasia,Alejandro Lozano,Alice Heiman,Allison Sihan Jia,Amit Kaushal,Angela Jia,Angelica Iacovelli,Archer Yang,Arghavan Salles,Arpita Singhal,Balasubramanian Narasimhan,Benjamin Belai,Benjamin H. Jacobson,Binglan Li,Celeste H. Poe,Chandan Sanghera,Chenming Zheng,Conor Messer,Damien Varid Kettud,Deven Pandya,Dhamanpreet Kaur,Diana Hla,Diba Dindoust,Dominik Moehrle,Duncan Ross,Ellaine Chou,Eric Lin,Fateme Nateghi Haredasht,Ge Cheng,Irena Gao,Jacob Chang,Jake Silberg,Jason A. Fries,Jiapeng Xu,Joe Jamison,John S. Tamaresis,Jonathan H Chen,Joshua Lazaro,Juan M. Banda,Julie J. Lee,Karen Ebert Matthys,Kirsten R. Steffner,Lu Tian,Luca Pegolotti,Malathi Srinivasan,Maniragav Manimaran,Matthew Schwede,Minghe Zhang,Minh Nguyen,Mohsen Fathzadeh,Qian Zhao,Rika Bajra,Rohit Khurana,Ruhana Azam,Rush Bartlett,Sang T. Truong,Scott L. Fleming,Shriti Raj,Solveig Behr,Sonia Onyeka,Sri Muppidi,Tarek Bandali,Tiffany Y. Eulalio,Wenyuan Chen,Xuanyu Zhou,Yanan Ding,Ying Cui,Yuqi Tan,Yutong Liu,Nigam H. Shah,Roxana Daneshjou,Crystal Tin-Tin Chang,Jesutofunmi A Omiye,Benjamin H Jacobson,Celeste H Poe,Jason A Fries,John S Tamaresis,Juan M Banda,Julie J Lee,Kirsten R Steffner,Sang T Truong,Scott L Fleming,Tiffany Y Eulalio,Nigam H Shah

Abstract:Background: The integration of large language models (LLMs) in healthcare offers immense opportunity to streamline healthcare tasks, but also carries risks such as response accuracy and bias perpetration. To address this, we conducted a red-teaming exercise to assess LLMs in healthcare and developed a dataset of clinically relevant scenarios for future teams to use. Methods: We convened 80 multi-disciplinary experts to evaluate the performance of popular LLMs across multiple medical scenarios. Teams composed of clinicians, medical and engineering students, and technical professionals stress-tested LLMs with real world clinical use cases. Teams were given a framework comprising four categories to analyze for inappropriate responses: Safety, Privacy, Hallucinations, and Bias. Prompts were tested on GPT-3.5, GPT-4.0, and GPT-4.0 with the Internet. Six medically trained reviewers subsequently reanalyzed the prompt-response pairs, with dual reviewers for each prompt and a third to resolve discrepancies. This process allowed for the accurate identification and categorization of inappropriate or inaccurate content within the responses. Results: There were a total of 382 unique prompts, with 1146 total responses across three iterations of ChatGPT (GPT-3.5, GPT-4.0, GPT-4.0 with Internet). 19.8% of the responses were labeled as inappropriate, with GPT-3.5 accounting for the highest percentage at 25.7% while GPT-4.0 and GPT-4.0 with internet performing comparably at 16.2% and 17.5% respectively. Interestingly, 11.8% of responses were deemed appropriate with GPT-3.5 but inappropriate in updated models, highlighting the ongoing need to evaluate evolving LLMs. Conclusion: The red-teaming exercise underscored the benefits of interdisciplinary efforts, as this collaborative model fosters a deeper understanding of the potential limitations of LLMs in healthcare and sets a precedent for future red teaming events in the field. Additionally, we present all prompts and outputs as a benchmark for future LLM model evaluations.

Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles

Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions

PATIENT-Ψ: Using Large Language Models to Simulate Patients for Training Mental Health Professionals

Can AI Relate: Testing Large Language Model Response for Mental Health Support

Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior

LLMs Can Simulate Standardized Patients via Agent Coevolution

Enhancing AI-Driven Psychological Consultation: Layered Prompts with Large Language Models

AIPatient: Simulating Patients with EHRs and LLM Powered Agentic Workflow

The Oscars of AI Theater: A Survey on Role-Playing with Language Models

Quantifying Risk Propensities of Large Language Models: Ethical Focus and Bias Detection through Role-Play

Large Language Model-based Role-Playing for Personalized Medical Jargon Extraction

Leveraging Large Language Model as Simulated Patients for Clinical Education

LLM-empowered Chatbots for Psychiatrist and Patient Simulation: Application and Evaluation

Prompt Engineering a Schizophrenia Chatbot: Utilizing a Multi-Agent Approach for Enhanced Compliance with Prompt Instructions

WundtGPT: Shaping Large Language Models To Be An Empathetic, Proactive Psychologist

LLM Roleplay: Simulating Human-Chatbot Interaction

RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models

Creating virtual patients using large language models: scalable, global, and low cost

Self-Prompt Tuning: Enable Autonomous Role-Playing in LLMs

Script-Strategy Aligned Generation: Aligning LLMs with Expert-Crafted Dialogue Scripts and Therapeutic Strategies for Psychotherapy

Domain-Specific Improvement on Psychotherapy Chatbot Using Assistant