A Survey on Fairness in Large Language Models

Yingji Li,Mengnan Du,Rui Song,Xin Wang,Ying Wang
2024-02-21
Abstract:Large Language Models (LLMs) have shown powerful performance and development prospects and are widely deployed in the real world. However, LLMs can capture social biases from unprocessed training data and propagate the biases to downstream tasks. Unfair LLM systems have undesirable social impacts and potential harms. In this paper, we provide a comprehensive review of related research on fairness in LLMs. Considering the influence of parameter magnitude and training paradigm on research strategy, we divide existing fairness research into oriented to medium-sized LLMs under pre-training and fine-tuning paradigms and oriented to large-sized LLMs under prompting paradigms. First, for medium-sized LLMs, we introduce evaluation metrics and debiasing methods from the perspectives of intrinsic bias and extrinsic bias, respectively. Then, for large-sized LLMs, we introduce recent fairness research, including fairness evaluation, reasons for bias, and debiasing methods. Finally, we discuss and provide insight on the challenges and future directions for the development of fairness in LLMs.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to address the fairness issues in large - language models (LLMs). Specifically, the paper focuses on how to evaluate and mitigate the social biases exhibited by these models when processing natural languages. Social biases may originate from unprocessed training data and be transmitted to downstream tasks through embedded representations, leading to unfair results. For example, some models may make discriminatory or stereotypical judgments about specific genders, races, or other social groups. ### Main problems in the paper: 1. **Propagation of social biases**: Social biases captured by large - language models from unprocessed training data will spread to downstream tasks, thus affecting the fairness of decision - making. 2. **Unfair social impacts**: Biased language models may lead to discriminatory, stereotypical, and demeaning decisions against vulnerable or marginalized groups, causing adverse social impacts and potential harm. 3. **Challenges of different training paradigms**: With the increase in the number of model parameters and the emergence of new training paradigms (such as the prompt paradigm), researchers need to address different fairness challenges. The paper divides the existing fairness research into the pre - training and fine - tuning paradigms for medium - sized language models, and the prompt paradigm for large - language models. ### Solutions in the paper: - **Evaluation metrics**: The paper introduces various metrics for evaluating internal and external biases, including similarity - based metrics, probability - based metrics, etc. - **Debiasing methods**: For medium - sized language models, the paper discusses internal and external debiasing methods; for large - language models, the paper explores fairness evaluation, bias causes, and debiasing methods in the prompt paradigm. - **Future directions**: The paper also discusses the current challenges and future research directions, providing valuable insights for subsequent research. ### Formula explanations: Some formulas involved in the text are used to quantify the degree of bias. For example: - Formula for internal bias: \[ |E_i(z) - E_i(z')| > \epsilon_i \] where \( E_i(\cdot) \) is the internal bias evaluation metric, \( z = M(x) \) and \( z' = M(x') \) are the embedded representations of samples \( x \) and \( x' \) representing different demographic groups respectively, and \( \epsilon_i \) is the desired fairness threshold. - Formula for external bias: \[ |E_e(y) - E_e(y')| > \epsilon_e \] where \( E_e(\cdot) \) is the external bias evaluation metric, \( y = C(M'(x)) \) and \( y' = C(M'(x')) \) are the prediction outputs of the fine - tuned model for samples \( x \) and \( x' \) respectively, and \( \epsilon_e \) is the desired fairness threshold. In conclusion, this paper aims to comprehensively review and analyze the fairness issues in large - language models, provide effective methods for evaluating and mitigating biases, and point out the direction for future fairness research.