Abstract:Large language models (LLMs) can pass explicit social bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. Measuring such implicit biases can be a challenge: as LLMs become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a concern if they affect the actual decisions that these systems make. We address both challenges by introducing two new measures of bias: LLM Implicit Bias, a prompt-based method for revealing implicit bias; and LLM Decision Bias, a strategy to detect subtle discrimination in decision-making tasks. Both measures are based on psychological research: LLM Implicit Bias adapts the Implicit Association Test, widely used to study the automatic associations between concepts held in human minds; and LLM Decision Bias operationalizes psychological results indicating that relative evaluations between two candidates, not absolute evaluations assessing each independently, are more diagnostic of implicit biases. Using these measures, we found pervasive stereotype biases mirroring those in society in 8 value-aligned models across 4 social categories (race, gender, religion, health) in 21 stereotypes (such as race and criminality, race and weapons, gender and science, age and negativity). Our prompt-based LLM Implicit Bias measure correlates with existing language model embedding-based bias methods, but better predicts downstream behaviors measured by LLM Decision Bias. These new prompt-based measures draw from psychology's long history of research into measuring stereotype biases based on purely observable behavior; they expose nuanced biases in proprietary value-aligned LLMs that appear unbiased according to standard benchmarks.

Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework

Cognitive Bias in Decision-Making with LLMs

A Multi-LLM Debiasing Framework

Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings

Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis

LIDAO: Towards Limited Interventions for Debiasing (Large) Language Models

Social Debiasing for Fair Multi-modal LLMs

Causal-Guided Active Learning for Debiasing Large Language Models

Measuring Implicit Bias in Explicitly Unbiased Large Language Models

Evaluating Nuanced Bias in Large Language Model Free Response Answers

Causal Prompting: Debiasing Large Language Model Prompting based on Front-Door Adjustment

Promoting Equality in Large Language Models: Identifying and Mitigating the Implicit Bias based on Bayesian Theory

Unboxing Occupational Bias: Grounded Debiasing of LLMs with U.S. Labor Data

Investigating Bias in LLM-Based Bias Detection: Disparities between LLMs and Human Perception

How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making?

Locating and Mitigating Gender Bias in Large Language Models

Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis

Editable Fairness: Fine-Grained Bias Mitigation in Language Models

Discovering Bias in Latent Space: An Unsupervised Debiasing Approach

Thinking Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models

Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs