Abstract:While large neural-based conversational models have become increasingly proficient dialogue agents, recent work has highlighted safety issues with these systems. For example, these systems can be goaded into generating toxic content, which often perpetuates social biases or stereotypes. We investigate a retrieval-based method for reducing bias and toxicity in responses from chatbots. It uses in-context learning to steer a model towards safer generations. Concretely, to generate a response to an unsafe dialogue context, we retrieve demonstrations of safe responses to similar dialogue contexts. We find our method performs competitively with strong baselines without requiring training. For instance, using automatic evaluation, we find our best fine-tuned baseline only generates safe responses to unsafe dialogue contexts from DiaSafety 4.04% more than our approach. Finally, we also propose a re-ranking procedure which can further improve response safeness.

What problem does this paper attempt to address?

This paper attempts to solve the problem of generating toxic content in dialogue systems. Specifically, although large neural - network - based dialogue models are becoming more and more proficient in dialogue capabilities, they are prone to generate unsafe content, such as toxic remarks and social biases. These problems not only affect the user experience, but also may have a negative impact on society. Therefore, the focus of this research is to reduce the toxic responses of dialogue systems and improve the safety and suitability of dialogues. ### Main research questions 1. **Can the use of safety demonstrations in context improve the response safety of dialogue systems?** - The research has experimentally verified that using safety demonstrations (i.e., safe dialogue examples in previous similar situations) when generating responses can significantly reduce the toxicity of the generated content while maintaining the coherence and attractiveness of the dialogue. 2. **How does the context - based learning method compare with the existing safe - response - generation methods?** - This study compared the context - based learning method with three popular existing methods, including Fine - Tuning, Self - Debias, and Director. The results show that the context - based learning method can effectively reduce toxic content without additional training, and performs equally well as or even better than the existing methods in some metrics. ### Method overview 1. **Retrieve safety demonstrations**: - Use methods such as BM25 or SentenceTransformer to retrieve safety demonstrations according to the similarity of the target dialogue context. These demonstrations are complete dialogues containing unsafe remarks and corresponding safe responses. 2. **Response generation**: - Combine the retrieved safety demonstrations and the target dialogue context into a Prompt and input it into the generation model to guide the model to generate safer responses. ### Experimental setup - **Dataset**: Use three datasets, ProsocialDialog, DiaSafety, and Commonsense - Dialogues, for experiments. - **Evaluation metrics**: Comprehensively evaluate the safety and relevance of the generated responses through automatic evaluation (such as safety classifiers, PerspectiveAPI, Offensive Word List) and human evaluation (such as LLM - EVAL). ### Main findings - **Safety improvement**: Using safety demonstrations significantly improves the safety of the generated responses, especially when dealing with unsafe inputs. - **Quality maintenance**: While improving safety, the generated responses still maintain high coherence and attractiveness without significantly sacrificing the quality of the dialogue. - **Comparison with existing methods**: The context - based learning method has performance comparable to existing safe - response - generation methods without additional training, and even performs better in some aspects. ### Conclusion This study has proven that the context - based learning method can effectively improve the safety of dialogue systems without sacrificing the quality of dialogue. This method provides a new solution for the safety and reliability of dialogue systems, especially when dealing with newly emerging unsafe inputs, it has high flexibility and adaptability.

Using In-Context Learning to Improve Dialogue Safety

Improving Dialog Safety using Socially Aware Contrastive Learning

Deep Reinforcement Learning for Dialogue Generation

Dialogue Learning with Human-in-the-Loop.

Learning through Dialogue Interactions by Asking Questions

Adversarial Learning for Neural Dialogue Generation.

Teaching Machines to Converse

Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation

On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark

Learn What NOT to Learn: Towards Generative Safety in Chatbots

A Benchmark for Understanding Dialogue Safety in Mental Health Support

ProsocialDialog: A Prosocial Backbone for Conversational Agents

Recent Advances towards Safe, Responsible, and Moral Dialogue Systems: A Survey

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness

Identifying Untrustworthy Samples: Data Filtering for Open-domain Dialogues with Bayesian Optimization

GrounDial: Human-norm Grounded Safe Dialog Response Generation

In-Context Learning Can Re-learn Forbidden Tasks

"In Dialogues We Learn": Towards Personalized Dialogue Without Pre-defined Profiles through In-Dialogue Learning

Leveraging Implicit Feedback from Deployment Data in Dialogue

Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling