Abstract:Fine-tuning large language models (LLMs) on additional datasets is often necessary to optimize them for specific downstream tasks. However, existing safety alignment measures, which restrict harmful behavior during inference, are insufficient to mitigate safety risks during fine-tuning. Alarmingly, fine-tuning with just 10 toxic sentences can make models comply with harmful instructions. We introduce SafetyLock, a novel alignment intervention method that maintains robust safety post-fine-tuning through efficient and transferable mechanisms. SafetyLock leverages our discovery that fine-tuned models retain similar safety-related activation representations to their base models. This insight enables us to extract what we term the Meta-SafetyLock, a set of safety bias directions representing key activation patterns associated with safe responses in the original model. We can then apply these directions universally to fine-tuned models to enhance their safety. By searching for activation directions across multiple token dimensions, SafetyLock achieves enhanced robustness and transferability. SafetyLock re-aligns fine-tuned models in under 0.01 seconds without additional computational cost. Our experiments demonstrate that SafetyLock can reduce the harmful instruction response rate from 60% to below 1% in toxic fine-tuned models. It surpasses traditional methods in both performance and efficiency, offering a scalable, non-invasive solution for ensuring the safety of customized LLMs. Our analysis across various fine-tuning scenarios confirms SafetyLock's robustness, advocating its integration into safety protocols for aligned LLMs. The code is released at <a class="link-external link-https" href="https://github.com/zhu-minjun/SafetyLock" rel="external noopener nofollow">this https URL</a>.

InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector Through Instruction Tuning

Locking Down the Finetuned LLMs Safety

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Improving Few-shot Generalization of Safety Classifiers via Data Augmented Parameter-Efficient Fine-Tuning

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer

Safer-Instruct: Aligning Language Models with Automated Preference Data

SAFETY-J: Evaluating Safety with Critique

What Makes and Breaks Safety Fine-tuning? A Mechanistic Study

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

Cross-Task Defense: Instruction-Tuning LLMs for Content Safety

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Safety-Aware Fine-Tuning of Large Language Models

"Don't forget to put the milk back!" Dataset for Enabling Embodied Agents to Detect Anomalous Situations

CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs

Instruction Tuning for Secure Code Generation

CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models