Abstract:Large language models(LLM) are pre-trained on extensive corpora to learn facts and human cognition which contain human preferences. However, this process can inadvertently lead to these models acquiring biases and stereotypes prevalent in society. Prior research has typically tackled the issue of bias through a one-dimensional perspective, concentrating either on locating or mitigating it. This limited perspective has created obstacles in facilitating research on bias to synergistically complement and progressively build upon one another. In this study, we integrate the processes of locating and mitigating bias within a unified framework. Initially, we use causal mediation analysis to trace the causal effects of different components' activation within a large language model. Building on this, we propose the LSDM (Least Square Debias Method), a knowledge-editing based method for mitigating gender bias in occupational pronouns, and compare it against two baselines on three gender bias datasets and seven knowledge competency test datasets. The experimental results indicate that the primary contributors to gender bias are the bottom MLP modules acting on the last token of occupational pronouns and the top attention module acting on the final word in the sentence. Furthermore, LSDM mitigates gender bias in the model more effectively than the other baselines, while fully preserving the model's capabilities in all other aspects.

Debiasing Algorithm through Model Adaptation

Sustainable Modular Debiasing of Language Models

Locating and Mitigating Gender Bias in Large Language Models

Language Models Get a Gender Makeover: Mitigating Gender Bias with Few-Shot Data Interventions

Parameter-efficient Modularised Bias Mitigation via AdapterFusion

Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models

Interpreting Bias in Large Language Models: A Feature-Based Approach

Does Debiasing Inevitably Degrade the Model Performance

Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models

Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model

The Birth of Bias: A case study on the evolution of gender bias in an English language model

Social Debiasing for Fair Multi-modal LLMs

Mitigating Large Language Model Bias: Automated Dataset Augmentation and Prejudice Quantification

Reducing Gender Bias in Abusive Language Detection

Evaluating Gender Bias Transfer between Pre-trained and Prompt-Adapted Language Models

Editable Fairness: Fine-Grained Bias Mitigation in Language Models

An Empirical Analysis of Parameter-Efficient Methods for Debiasing Pre-Trained Language Models

BiasDPO: Mitigating Bias in Language Models through Direct Preference Optimization

A Trip Towards Fairness: Bias and De-Biasing in Large Language Models

REFINE-LM: Mitigating Language Model Stereotypes via Reinforcement Learning

Large Language Model Bias Mitigation from the Perspective of Knowledge Editing