Abstract:Background: Artificial intelligence chatbots such as ChatGPT (OpenAI) have garnered excitement about their potential for delegating writing tasks ordinarily performed by humans. Many of these tasks (eg, writing recommendation letters) have social and professional ramifications, making the potential social biases in ChatGPT's underlying language model a serious concern. Objective: Three preregistered studies used the text analysis program Linguistic Inquiry and Word Count to investigate gender bias in recommendation letters written by ChatGPT in human-use sessions (N=1400 total letters). Methods: We conducted analyses using 22 existing Linguistic Inquiry and Word Count dictionaries, as well as 6 newly created dictionaries based on systematic reviews of gender bias in recommendation letters, to compare recommendation letters generated for the 200 most historically popular "male" and "female" names in the United States. Study 1 used 3 different letter-writing prompts intended to accentuate professional accomplishments associated with male stereotypes, female stereotypes, or neither. Study 2 examined whether lengthening each of the 3 prompts while holding the between-prompt word count constant modified the extent of bias. Study 3 examined the variability within letters generated for the same name and prompts. We hypothesized that when prompted with gender-stereotyped professional accomplishments, ChatGPT would evidence gender-based language differences replicating those found in systematic reviews of human-written recommendation letters (eg, more affiliative, social, and communal language for female names; more agentic and skill-based language for male names). Results: Significant differences in language between letters generated for female versus male names were observed across all prompts, including the prompt hypothesized to be neutral, and across nearly all language categories tested. Historically female names received significantly more social referents (5/6, 83% of prompts), communal or doubt-raising language (4/6, 67% of prompts), personal pronouns (4/6, 67% of prompts), and clout language (5/6, 83% of prompts). Contradicting the study hypotheses, some gender differences (eg, achievement language and agentic language) were significant in both the hypothesized and nonhypothesized directions, depending on the prompt. Heteroscedasticity between male and female names was observed in multiple linguistic categories, with greater variance for historically female names than for historically male names. Conclusions: ChatGPT reproduces many gender-based language biases that have been reliably identified in investigations of human-written reference letters, although these differences vary across prompts and language categories. Caution should be taken when using ChatGPT for tasks that have social consequences, such as reference letter writing. The methods developed in this study may be useful for ongoing bias testing among progressive generations of chatbots across a range of real-world scenarios. Trial Registration: OSF Registries osf.io/ztv96; https://osf.io/ztv96

Bias Perpetuates Bias: ChatGPT Learns Gender Inequities in Academic Surgery Promotions

Gender Bias in Artificial Intelligence-Written Letters of Reference

What's in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPT

Editor's Spotlight/Take 5: How Prominent Are Gender Bias, Racial Bias, and Score Inflation in Orthopaedic Surgery Residency Recommendation Letters? A Systematic Review

Racial and Gender Discrimination in Hand Surgery Letters of Recommendation

ChatGPT Exhibits Gender and Racial Biases in Acute Coronary Syndrome Management

Fairness in AI-Driven Oncology: Investigating Racial and Gender Biases in Large Language Models

Examining Implicit Bias Differences in Pediatric Surgical Fellowship Letters of Recommendation Using Natural Language Processing

Dear Program Director: An Evaluation of Implicit Bias in Letters of Recommendation for Neurosurgery Residency

Response to "Letter to the Editor-Exploring the Unknown: Evaluating ChatGPT's Performance in Uncovering Novel Aspects of Plastic Surgery and Identifying Areas for Future Innovation"

Gender bias in reference letters for residency and academic medicine: a systematic review

Gender bias in colorectal surgery fellowship letters of recommendation

Analysis of a cohort of 101 CDAII patients: description of 24 new molecular variants and genotype‐phenotype correlations

Gender Bias in Letters of Recommendation: Relevance to Urology Match Outcomes and Pursuit of Fellowship Training/Academic Career

Gender Differences in Letters of Recommendations and Personal Statements for Neurotology Fellowship over 10 Years: A Deep Learning Linguistic Analysis

Gender bias in postgraduate year one pharmacy letters of recommendation

Identifying Gender and Racial Bias in Pediatric Fellowship Letters of Recommendation: Do Word Choices Influence Interview Decisions?

Do Gender Differences Exist in Letters of Recommendation for Reproductive Endocrinology and Infertility Fellowship?

Computer says 'no': Exploring systemic bias in ChatGPT using an audit approach

Digital Ink and Surgical Dreams: Perceptions of Artificial Intelligence-Generated Essays in Residency Applications

Gender and culture bias in letters of recommendation for computer science and data science masters programs