Abstract:Large language models (LLMs) can pass explicit social bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. Measuring such implicit biases can be a challenge: as LLMs become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a concern if they affect the actual decisions that these systems make. We address both challenges by introducing two new measures of bias: LLM Implicit Bias, a prompt-based method for revealing implicit bias; and LLM Decision Bias, a strategy to detect subtle discrimination in decision-making tasks. Both measures are based on psychological research: LLM Implicit Bias adapts the Implicit Association Test, widely used to study the automatic associations between concepts held in human minds; and LLM Decision Bias operationalizes psychological results indicating that relative evaluations between two candidates, not absolute evaluations assessing each independently, are more diagnostic of implicit biases. Using these measures, we found pervasive stereotype biases mirroring those in society in 8 value-aligned models across 4 social categories (race, gender, religion, health) in 21 stereotypes (such as race and criminality, race and weapons, gender and science, age and negativity). Our prompt-based LLM Implicit Bias measure correlates with existing language model embedding-based bias methods, but better predicts downstream behaviors measured by LLM Decision Bias. These new prompt-based measures draw from psychology's long history of research into measuring stereotype biases based on purely observable behavior; they expose nuanced biases in proprietary value-aligned LLMs that appear unbiased according to standard benchmarks.

CDAIL-BIAS MEASURER: A Model Ensemble Approach for Dialogue Social Bias Measurement

Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark

Towards Identifying Social Bias in Dialog Systems: Frame, Datasets, and Benchmarks

CHBias: Bias Evaluation and Mitigation of Chinese Conversational Language Models

CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models

BiasAsker: Measuring the Bias in Conversational AI System

Data-Augmented and Retrieval-Augmented Context Enrichment in Chinese Media Bias Detection

BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs

CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models

Towards Measuring Fairness in Speech Recognition: Casual Conversations Dataset Transcriptions

FairMonitor: A Dual-framework for Detecting Stereotypes and Biases in Large Language Models

SocialDial: A Benchmark for Socially-Aware Dialogue Systems

Social Debiasing for Fair Multi-modal LLMs

Ask LLMs Directly, "What shapes your bias?": Measuring Social Bias in Large Language Models

Can Instruction Fine-Tuned Language Models Identify Social Bias through Prompting?

This Prompt is Measuring <MASK>: Evaluating Bias Evaluation in Language Models

Measuring Implicit Bias in Explicitly Unbiased Large Language Models

Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings

Position Debiasing Fine-Tuning for Causal Perception in Long-Term Dialogue

A Bi-directional Multi-hop Inference Model for Joint Dialog Sentiment Classification and Act Recognition

CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias