Abstract:Introduction: Large language model (LLM) chatbots have many applications in medical settings. However, these tools can potentially perpetuate racial and gender biases through their responses, worsening disparities in healthcare. With the ongoing discussion of LLM chatbots in oncology and the widespread goal of addressing cancer disparities, this study focuses on biases propagated by LLM chatbots in oncology. Methods: Chat Generative Pre-trained Transformer (Chat GPT; OpenAI, San Francisco, CA, USA) was asked to determine what occupation a generic description of "assesses cancer patients" would correspond to for different demographics. Chat GPT, Gemini (Alphabet Inc., Mountain View, CA, USA), and Bing Chat (Microsoft Corp., Redmond, WA, USA) were prompted to provide oncologist recommendations in the top U.S. cities and demographic information (race, gender) of recommendations was compared against national distributions. Chat GPT was also asked to generate a job description for oncologists with different demographic backgrounds. Finally, Chat GPT, Gemini, and Bing Chat were asked to generate hypothetical cancer patients with race, smoking, and drinking histories. Results: LLM chatbots are about two times more likely to predict Blacks and Native Americans as oncology nurses than oncologists, compared to Asians (p < 0.01 and < 0.001, respectively). Similarly, they are also significantly more likely to predict females than males as oncology nurses (p < 0.001). Chat GPT's real-world oncologist recommendations overrepresent Asians by almost double and underrepresent Blacks by double and Hispanics by seven times. Chatbots also generate different job descriptions based on demographics, including cultural competency and advocacy and excluding treatment administration for underrepresented backgrounds. AI-generated cancer cases are not fully representative of real-world demographic distributions and encode stereotypes on substance abuse, such as Hispanics having a greater proportion of smokers than Whites by about 20% in Chat GPT breast cancer cases. Conclusion: To our knowledge, this is the first study of its kind to investigate racial and gender biases of such a diverse set of AI chatbots, and that too, within oncology. The methodology presented in this study provides a framework for targeted bias evaluation of LLMs in various fields across medicine.

Bias of AI-generated content: an examination of news produced by large language models

AI AI Bias: Large Language Models Favor Their Own Generated Content

Hey GPT, Can You be More Racist? Analysis from Crowdsourced Attempts to Elicit Biased Content from Generative AI

Bias Similarity Across Large Language Models

Laissez-Faire Harms: Algorithmic Biases in Generative Language Models

White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMs

Large Language Models Portray Socially Subordinate Groups as More Homogeneous, Consistent with a Bias Observed in Humans

Revealing Hidden Bias in AI: Lessons from Large Language Models

Measuring Gender and Racial Biases in Large Language Models

Artificial Intelligence Tools and Bias in Journalism-related Content Generation: Comparison Between Chat GPT-3.5, GPT-4 and Bing

Bias in Generative AI

Is ChatGPT More Biased Than You?

ChatGPT Exhibits Gender and Racial Biases in Acute Coronary Syndrome Management

Fairness in AI-Driven Oncology: Investigating Racial and Gender Biases in Large Language Models

From Bytes to Biases: Investigating the Cultural Self-Perception of Large Language Models

Stars, Stripes, and Silicon: Unravelling the ChatGPT's All-American, Monochrome, Cis-centric Bias

Public Perceptions of Gender Bias in Large Language Models: Cases of ChatGPT and Ernie

Gender Bias in Large Language Models across Multiple Languages

Evaluating Gender, Racial, and Age Biases in Large Language Models: A Comparative Analysis of Occupational and Crime Scenarios

Exploring the Impact of Artificial Intelligence-Mediated Communication on Bias and Information Loss in Non-academic and Academic Writing Contexts