Abstract:Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases, which may cause negative social impacts or even bring catastrophic results in application. Previous works on this problem mainly focused on using black-box methods such as probing to detect and quantify social biases in PLMs by observing model outputs. As a result, previous debiasing methods mainly finetune or even pre-train PLMs on newly constructed anti-stereotypical datasets, which are high-cost. In this work, we try to unveil the mystery of social bias inside language models by introducing the concept of {\sc Social Bias Neurons}. Specifically, we propose {\sc Integrated Gap Gradients (IG$^2$)} to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias. By formalizing undesirable behavior as a distributional property of language, we employ sentiment-bearing prompts to elicit classes of sensitive words (demographics) correlated with such sentiments. Our IG$^2$ thus attributes the uneven distribution for different demographics to specific Social Bias Neurons, which track the trail of unwanted behavior inside PLM units to achieve interoperability. Moreover, derived from our interpretable technique, {\sc Bias Neuron Suppression (BNS)} is further proposed to mitigate social biases. By studying BERT, RoBERTa, and their attributable differences from debiased FairBERTa, IG$^2$ allows us to locate and suppress identified neurons, and further mitigate undesired behaviors. As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost\footnote{This work contains examples that potentially implicate stereotypes, associations, and other harms that could be offensive to individuals in certain social groups.}.

Mitigating Social Biases of Pre-trained Language Models via Contrastive Self-Debiasing with Double Data Augmentation

Prompt Tuning Pushes Farther, Contrastive Learning Pulls Closer: A Two-Stage Approach to Mitigate Social Biases

Social Debiasing for Fair Multi-modal LLMs

An Empirical Analysis of Parameter-Efficient Methods for Debiasing Pre-Trained Language Models

Towards Understanding Task-agnostic Debiasing Through the Lenses of Intrinsic Bias and Forgetfulness

Sustainable Modular Debiasing of Language Models

Co$^2$PT: Mitigating Bias in Pre-trained Language Models through Counterfactual Contrastive Prompt Tuning

The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models

ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs

The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models

Detecting Bias in Large Language Models: Fine-tuned KcBERT

Mitigating Large Language Model Bias: Automated Dataset Augmentation and Prejudice Quantification

General Phrase Debiaser: Debiasing Masked Language Models at a Multi-Token Level

MABEL: Attenuating Gender Bias using Textual Entailment Data

A Contrastive Learning Approach to Mitigate Bias in Speech Models

Mitigating Social Biases in Language Models through Unlearning

Towards Understanding and Mitigating Social Biases in Language Models

CHBias: Bias Evaluation and Mitigation of Chinese Conversational Language Models

Do the Right Thing, Just Debias! Multi-Category Bias Mitigation Using LLMs

Debias your Large Multi-Modal Model at Test-Time with Non-Contrastive Visual Attribute Steering

Projective Methods for Mitigating Gender Bias in Pre-trained Language Models