Abstract:Importance: With the growing use of large language models (LLMs) in education and health care settings, it is important to ensure that the information they generate is diverse and equitable, to avoid reinforcing or creating stereotypes that may influence the aspirations of upcoming generations. Objective: To evaluate the gender representation of LLM-generated stories involving medical doctors, surgeons, and nurses and to investigate the association of varying personality and professional seniority descriptors with the gender proportions for these professions. Design, setting, and participants: This is a cross-sectional simulation study of publicly accessible LLMs, accessed from December 2023 to January 2024. GPT-3.5-turbo and GPT-4 (OpenAI), Gemini-pro (Google), and Llama-2-70B-chat (Meta) were prompted to generate 500 stories featuring medical doctors, surgeons, and nurses for a total 6000 stories. A further 43 200 prompts were submitted to the LLMs containing varying descriptors of personality (agreeableness, neuroticism, extraversion, conscientiousness, and openness) and professional seniority. Main outcomes and measures: The primary outcome was the gender proportion (she/her vs he/him) within stories generated by LLMs about medical doctors, surgeons, and nurses, through analyzing the pronouns contained within the stories using χ2 analyses. The pronoun proportions for each health care profession were compared with US Census data by descriptive statistics and χ2 tests. Results: In the initial 6000 prompts submitted to the LLMs, 98% of nurses were referred to by she/her pronouns. The representation of she/her for medical doctors ranged from 50% to 84%, and that for surgeons ranged from 36% to 80%. In the 43 200 additional prompts containing personality and seniority descriptors, stories of medical doctors and surgeons with higher agreeableness, openness, and conscientiousness, as well as lower neuroticism, resulted in higher she/her (reduced he/him) representation. For several LLMs, stories focusing on senior medical doctors and surgeons were less likely to be she/her than stories focusing on junior medical doctors and surgeons. Conclusions and relevance: This cross-sectional study highlights the need for LLM developers to update their tools for equitable and diverse gender representation in essential health care roles, including medical doctors, surgeons, and nurses. As LLMs become increasingly adopted throughout health care and education, continuous monitoring of these tools is needed to ensure that they reflect a diverse workforce, capable of serving society's needs effectively.

Evaluation of Bias Towards Medical Professionals in Large Language Models

Evaluating and Addressing Demographic Disparities in Medical Large Language Models: A Systematic Review

How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making?

Large Language Models in Otolaryngology Residency Admissions: A Random Sampling Analysis

Evaluating Anti-LGBTQIA+ Medical Bias in Large Language Models

Evaluating Gender, Racial, and Age Biases in Large Language Models: A Comparative Analysis of Occupational and Crime Scenarios

Bias patterns in the application of LLMs for clinical decision support: A comprehensive study

Unmasking and Quantifying Racial Bias of Large Language Models in Medical Report Generation

"You Gotta be a Doctor, Lin": An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations

The Unequal Opportunities of Large Language Models: Revealing Demographic Bias through Job Recommendations

Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study

Addressing cognitive bias in medical language models

Evaluation and mitigation of cognitive biases in medical language models

Gender Representation of Health Care Professionals in Large Language Model-Generated Stories

Do Large Language Models Discriminate in Hiring Decisions on the Basis of Race, Ethnicity, and Gender?

JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models

Large language models propagate race-based medicine

Coding Inequity: Assessing GPT-4's Potential for Perpetuating Racial and Gender Biases in Healthcare

Revealing Hidden Bias in AI: Lessons from Large Language Models

Measuring Gender and Racial Biases in Large Language Models

The Silicon Ceiling: Auditing GPT's Race and Gender Biases in Hiring