Abstract:Social biases can manifest in language agency. While several studies approached agency-related bias in human-written language, very limited research has investigated such biases in Large Language Model (LLM)-generated content. In addition, previous works often rely on string-matching techniques to identify agentic and communal words within texts, which fall short of accurately classifying language agency. We introduce the novel Language Agency Bias Evaluation (LABE) benchmark, which comprehensively evaluates biases in LLMs by analyzing agency levels attributed to different demographic groups in model generations. LABE leverages 5,400 template-based prompts, an accurate agency classifier, and corresponding bias metrics to test for gender, racial, and intersectional language agency biases in LLMs on 3 text generation tasks: biographies, professor reviews, and reference letters. We also contribute the Language Agency Classification (LAC) dataset, consisting of 3,724 agentic and communal sentences. Using LABE, we unveil language agency social biases in 3 recent LLMs: ChatGPT, Llama3, and Mistral. We observe that: (1) LLM generations tend to demonstrate greater gender bias than human-written texts; (2) Models demonstrate remarkably higher levels of intersectional bias than the other bias aspects. Those who are at the intersection of gender and racial minority groups--such as Black females--are consistently described by texts with lower levels of agency, aligning with real-world social inequalities; (3) Among the 3 LLMs investigated, Llama3 demonstrates the greatest overall bias; (4) Not only does prompt-based mitigation fail to resolve language agency bias in LLMs, but it frequently leads to the exacerbation of biases in generated texts.

Gender Bias in LLM-generated Interview Responses

Assessing Gender Bias in LLMs: Comparing LLM Outputs with Human Perceptions and Official Statistics

Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias

Evaluating LLMs for Gender Disparities in Notable Persons

Probing Explicit and Implicit Gender Bias through LLM Conditional Text Generation

Public Perceptions of Gender Bias in Large Language Models: Cases of ChatGPT and Ernie

Gender bias and stereotypes in Large Language Models

White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMs

The Silicon Ceiling: Auditing GPT's Race and Gender Biases in Hiring

Evaluating Gender Bias of LLMs in Making Morality Judgements

With a Grain of SALT: Are LLMs Fair Across Social Dimensions?

Gender Bias in Large Language Models across Multiple Languages

Hire Me or Not? Examining Language Model's Behavior with Occupation Attributes

Unveiling Gender Bias in Terms of Profession Across LLMs: Analyzing and Addressing Sociological Implications

Evaluation of Large Language Models: STEM education and Gender Stereotypes

Investigating Subtler Biases in LLMs: Ageism, Beauty, Institutional, and Nationality Bias in Generative Models

Gender Bias of LLM in Economics: An Existentialism Perspective

''Fifty Shades of Bias'': Normative Ratings of Gender Bias in GPT Generated English Text

The Unequal Opportunities of Large Language Models: Revealing Demographic Bias through Job Recommendations