Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes

Isabel O. Gallegos,Ryan A. Rossi,Joe Barrow,Md Mehrab Tanjim,Tong Yu,Hanieh Deilamsalehy,Ruiyi Zhang,Sungchul Kim,Franck Dernoncourt
2024-02-03
Abstract:Large language models (LLMs) have shown remarkable advances in language generation and understanding but are also prone to exhibiting harmful social biases. While recognition of these behaviors has generated an abundance of bias mitigation techniques, most require modifications to the training data, model parameters, or decoding strategy, which may be infeasible without access to a trainable model. In this work, we leverage the zero-shot capabilities of LLMs to reduce stereotyping in a technique we introduce as zero-shot self-debiasing. With two approaches, self-debiasing via explanation and self-debiasing via reprompting, we show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups while relying only on the LLM itself and a simple prompt, with explanations correctly identifying invalid assumptions and reprompting delivering the greatest reductions in bias. We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
Computation and Language,Artificial Intelligence,Computers and Society,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that large - language models (LLMs) exhibit harmful social biases when generating and understanding languages. Although these models have made remarkable progress in language processing, they are also prone to generating and magnifying derogatory, stereotypical, and exclusive social behaviors. Specifically, the paper focuses on how to reduce stereotypes in models through zero - shot self - debiasing without the need to modify training data, model parameters, or decoding strategies. ### Main contributions of the paper: 1. **Introduction of zero - shot self - debiasing techniques**: The authors propose two simple self - debiasing methods - self - debiasing via explanation and self - debiasing via reprompting. These two methods rely only on the language model itself and simple prompts. 2. **Verification of debiasing effects**: Through experiments in nine different social groups, it is shown that zero - shot self - debiasing techniques can significantly reduce stereotypes in model answers. ### Specific methods: - **Self - debiasing via explanation**: First, the model is required to explain the invalid assumptions on which the answer choice depends, and then the model is required to answer the question in the same conversation context. This method reduces bias by making the model identify potential stereotypes. - **Self - debiasing via reprompting**: First, the model is required to answer the question according to the baseline method. Then, after generating the answer, the model is prompted again to remove bias and re - answer the question. This method aims to make the model correct the initially possible stereotypes and maintain the consistency of the initial correct answer. ### Experimental results: - **Baseline performance**: Without using self - debiasing techniques, the model shows different degrees of bias for all social groups. - **Debiasing effects**: Through self - debiasing via explanation and self - debiasing via reprompting, the model's bias scores are significantly reduced. Especially in groups such as age, appearance, and socioeconomic status, the reduction in bias is most obvious. ### Conclusion: The paper shows through simple prompts that zero - shot self - debiasing techniques can significantly and consistently reduce stereotypes in large - language models. The authors hope that this work can encourage further exploration of zero - shot debiasing techniques in different tasks, models, and settings. ### Limitations: - **Limited evaluation scope**: The research mainly focuses on multiple - choice questions and cannot be directly extended to open - ended answering scenarios. - **Limitations of manual prompts**: Although the current prompts are general - purpose, manually designed prompts may be difficult to extend to other types of biases, such as exclusion norms or misrepresentation. ### Ethical considerations: The authors emphasize that technical solutions cannot completely replace broader actions against unequal power systems. They point out that zero - shot self - debiasing techniques should not be regarded as the only means to prevent representational harm, especially without further evaluation of their behavior in practical applications.