LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models

Hayder Elesedy,Pedro M. Esperança,Silviu Vlad Oprea,Mete Ozay
2024-07-03
Abstract:Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs). Existing model-based guardrails have not been designed for resource-constrained computational portable devices, such as mobile phones, more and more of which are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. LoRA-Guard extracts language features from the LLMs and adapts them for the content moderation task using low-rank adapters, while a dual-path design prevents any performance degradation on the generative task. We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper aims to address the issue of harmful content that large language models (LLMs) may generate when producing text. Specifically, due to the potential inclusion of undesirable content in the pre-training dataset, LLMs might generate offensive language or illegal suggestions in their responses, which contradicts safety requirements. To tackle this problem, the paper proposes a method called LoRA-Guard. ### Main Objectives: 1. **Enhance content moderation capabilities on resource-constrained devices**: Existing model-based content moderation methods are challenging to deploy on platforms with limited computational resources, such as mobile devices. LoRA-Guard reduces model overhead in a parameter-efficient manner, enabling content moderation on resource-constrained devices. 2. **Prevent performance degradation**: LoRA-Guard employs a dual-path design to avoid the performance degradation in generation tasks that traditional fine-tuning methods might cause. 3. **Efficient parameter sharing**: LoRA-Guard utilizes LoRA adapters to achieve knowledge sharing with the base chat model, significantly reducing the number of additional parameters introduced. ### Experimental Results: - LoRA-Guard outperforms existing methods on the ToxicChat dataset, with parameter overhead reduced by 100-1000 times. - On the OpenAI ModEval dataset, LoRA-Guard's performance is comparable to existing methods, but with significantly reduced parameter overhead. ### Conclusion: LoRA-Guard is an efficient content moderation system that can significantly reduce parameter overhead while maintaining or even enhancing performance. This provides an important solution for content moderation on resource-constrained devices. Future work will focus on cross-domain generalization capabilities and broader ethical considerations.