Abstract:Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases, which may cause negative social impacts or even bring catastrophic results in application. Previous works on this problem mainly focused on using black-box methods such as probing to detect and quantify social biases in PLMs by observing model outputs. As a result, previous debiasing methods mainly finetune or even pre-train PLMs on newly constructed anti-stereotypical datasets, which are high-cost. In this work, we try to unveil the mystery of social bias inside language models by introducing the concept of {\sc Social Bias Neurons}. Specifically, we propose {\sc Integrated Gap Gradients (IG$^2$)} to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias. By formalizing undesirable behavior as a distributional property of language, we employ sentiment-bearing prompts to elicit classes of sensitive words (demographics) correlated with such sentiments. Our IG$^2$ thus attributes the uneven distribution for different demographics to specific Social Bias Neurons, which track the trail of unwanted behavior inside PLM units to achieve interoperability. Moreover, derived from our interpretable technique, {\sc Bias Neuron Suppression (BNS)} is further proposed to mitigate social biases. By studying BERT, RoBERTa, and their attributable differences from debiased FairBERTa, IG$^2$ allows us to locate and suppress identified neurons, and further mitigate undesired behaviors. As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost\footnote{This work contains examples that potentially implicate stereotypes, associations, and other harms that could be offensive to individuals in certain social groups.}.

A Trip Towards Fairness: Bias and De-Biasing in Large Language Models

Bias and Fairness in Large Language Models: A Survey

Mitigating Large Language Model Bias: Automated Dataset Augmentation and Prejudice Quantification

Towards Understanding and Mitigating Social Biases in Language Models

Interpreting Bias in Large Language Models: A Feature-Based Approach

Reducing Large Language Model Bias with Emphasis on 'Restricted Industries': Automated Dataset Augmentation and Prejudice Quantification

FairPy: A Toolkit for Evaluation of Social Biases and their Mitigation in Large Language Models

Editable Fairness: Fine-Grained Bias Mitigation in Language Models

A Survey on Fairness in Large Language Models

Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models

Promoting Equality in Large Language Models: Identifying and Mitigating the Implicit Bias based on Bayesian Theory

Towards detecting unanticipated bias in Large Language Models

From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models

Exposing Bias in Online Communities through Large-Scale Language Models

Investigating Implicit Bias in Large Language Models: A Large-Scale Study of Over 50 LLMs

The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models

The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models

Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models

Cognitive bias in large language models: Cautious optimism meets anti-Panglossian meliorism

From Prejudice to Parity: A New Approach to Debiasing Large Language Model Word Embeddings