MBIAS: Mitigating Bias in Large Language Models While Retaining Context

Shaina Raza,Ananya Raval,Veronica Chatrath
2024-06-29
Abstract:The deployment of Large Language Models (LLMs) in diverse applications necessitates an assurance of safety without compromising the contextual integrity of the generated content. Traditional approaches, including safety-specific fine-tuning or adversarial testing, often yield safe outputs at the expense of contextual meaning. This can result in a diminished capacity to handle nuanced aspects of bias and toxicity, such as underrepresentation or negative portrayals across various demographics. To address these challenges, we introduce MBIAS, an LLM framework carefully instruction fine-tuned on a custom dataset designed specifically for safety interventions. MBIAS is designed to significantly reduce biases and toxic elements in LLM outputs while preserving the main information. This work also details our further use of LLMs: as annotator under human supervision and as evaluator of generated content. Empirical analysis reveals that MBIAS achieves a reduction in bias and toxicity by over 30\% in standard evaluations, and by more than 90\% in diverse demographic tests, highlighting the robustness of our approach. We make the dataset and the fine-tuned model available to the research community for further investigation and ensure reproducibility. The code for this project can be accessed here <a class="link-external link-https" href="https://github.com/shainarazavi/MBIAS/tree/main" rel="external noopener nofollow">this https URL</a>. Warning: This paper contains examples that may be offensive or upsetting.
Computation and Language
What problem does this paper attempt to address?
The paper mainly explores how large language models (LLMs) can ensure contextual integrity while generating content in a secure manner. Traditional security strategies such as specific secure fine-tuning or adversarial testing may sacrifice contextual meaning, leading to a decrease in the ability to handle complex biases and toxicity issues. To address this, the paper proposes a framework called MBIAS, which is a carefully instructed fine-tuned LLM using a specifically designed secure intervention custom dataset. MBIAS aims to significantly reduce biases and toxic elements in LLM outputs while preserving the main information. MBIAS enables the model to identify and generate unbiased responses by creating a dataset consisting of unsafe and secure text pairs, thereby reducing harmful content. The paper also investigates the role of LLM as human annotators and evaluators under human supervision. Experimental results show that MBIAS reduces over 30% of biases and toxicity in standard evaluation and over 90% in diverse demographic testing, demonstrating the robustness of the method. The authors provide the dataset and fine-tuned MBIAS model to facilitate further research and emphasize ethical considerations when modifying user-generated content, aiming to create a fair and copyright-respecting LLM generator. In conclusion, the paper attempts to address the issue of reducing biases and toxicity in language model outputs effectively through an improved LLM training method without compromising contextual integrity.