Abstract:Large language models (LLMs) have shown remarkable advances in language generation and understanding but are also prone to exhibiting harmful social biases. While recognition of these behaviors has generated an abundance of bias mitigation techniques, most require modifications to the training data, model parameters, or decoding strategy, which may be infeasible without access to a trainable model. In this work, we leverage the zero-shot capabilities of LLMs to reduce stereotyping in a technique we introduce as zero-shot self-debiasing. With two approaches, self-debiasing via explanation and self-debiasing via reprompting, we show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups while relying only on the LLM itself and a simple prompt, with explanations correctly identifying invalid assumptions and reprompting delivering the greatest reductions in bias. We hope this work opens inquiry into other zero-shot techniques for bias mitigation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that large - language models (LLMs) exhibit harmful social biases when generating and understanding languages. Although these models have made remarkable progress in language processing, they are also prone to generating and magnifying derogatory, stereotypical, and exclusive social behaviors. Specifically, the paper focuses on how to reduce stereotypes in models through zero - shot self - debiasing without the need to modify training data, model parameters, or decoding strategies. ### Main contributions of the paper: 1. **Introduction of zero - shot self - debiasing techniques**: The authors propose two simple self - debiasing methods - self - debiasing via explanation and self - debiasing via reprompting. These two methods rely only on the language model itself and simple prompts. 2. **Verification of debiasing effects**: Through experiments in nine different social groups, it is shown that zero - shot self - debiasing techniques can significantly reduce stereotypes in model answers. ### Specific methods: - **Self - debiasing via explanation**: First, the model is required to explain the invalid assumptions on which the answer choice depends, and then the model is required to answer the question in the same conversation context. This method reduces bias by making the model identify potential stereotypes. - **Self - debiasing via reprompting**: First, the model is required to answer the question according to the baseline method. Then, after generating the answer, the model is prompted again to remove bias and re - answer the question. This method aims to make the model correct the initially possible stereotypes and maintain the consistency of the initial correct answer. ### Experimental results: - **Baseline performance**: Without using self - debiasing techniques, the model shows different degrees of bias for all social groups. - **Debiasing effects**: Through self - debiasing via explanation and self - debiasing via reprompting, the model's bias scores are significantly reduced. Especially in groups such as age, appearance, and socioeconomic status, the reduction in bias is most obvious. ### Conclusion: The paper shows through simple prompts that zero - shot self - debiasing techniques can significantly and consistently reduce stereotypes in large - language models. The authors hope that this work can encourage further exploration of zero - shot debiasing techniques in different tasks, models, and settings. ### Limitations: - **Limited evaluation scope**: The research mainly focuses on multiple - choice questions and cannot be directly extended to open - ended answering scenarios. - **Limitations of manual prompts**: Although the current prompts are general - purpose, manually designed prompts may be difficult to extend to other types of biases, such as exclusion norms or misrepresentation. ### Ethical considerations: The authors emphasize that technical solutions cannot completely replace broader actions against unequal power systems. They point out that zero - shot self - debiasing techniques should not be regarded as the only means to prevent representational harm, especially without further evaluation of their behavior in practical applications.

Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes

Social Debiasing for Fair Multi-modal LLMs

Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings

A Multi-LLM Debiasing Framework

"Im not Racist but...": Discovering Bias in the Internal Knowledge of Large Language Models

Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection

Debiasing Multimodal Large Language Models

REFINE-LM: Mitigating Language Model Stereotypes via Reinforcement Learning

Uncovering Biases with Reflective Large Language Models

AXOLOTL: Fairness through Assisted Self-Debiasing of Large Language Model Outputs

Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis

Bias and Fairness in Large Language Models: A Survey

Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models

Towards Understanding and Mitigating Social Biases in Language Models

VersusDebias: Universal Zero-Shot Debiasing for Text-to-Image Models via SLM-Based Prompt Engineering and Generative Adversary

Do Multilingual Large Language Models Mitigate Stereotype Bias?

Leveraging Biases in Large Language Models: "bias-kNN'' for Effective Few-Shot Learning

Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models

Large Language Models Portray Socially Subordinate Groups as More Homogeneous, Consistent with a Bias Observed in Humans