Abstract:LLMs(Large Language Models) nowadays have widespread adoption as a tool for solving issues across various domain/tasks. These models since are susceptible to produce harmful or toxic results, inference-time adversarial attacks, therefore they do undergo safety alignment training and Red teaming for putting in safety guardrails. For using these models, usually fine-tuning is done for model alignment on the desired tasks, which can make model more aligned but also make it more susceptible to produce unsafe responses, if fine-tuned with harmful <a class="link-external link-http" href="http://data.In" rel="external noopener nofollow">this http URL</a> this paper, we study how much of impact introduction of harmful data in fine-tuning can make, and if it can override the safety protection of those models. Conversely,it was also explored that if model is fine-tuned on safety data can make the model produce more safer responses. Further we explore if fine-tuning the model on harmful data makes it less helpful or less trustworthy because of increase in model uncertainty leading to knowledge drift. Our extensive experimental results shown that Safety protection in an open-source can be overridden, when fine-tuned with harmful data as observed by ASR increasing by 35% when compared to basemodel's ASR. Also, as observed, fine-tuning a model with harmful data made the harmful fine-tuned model highly uncertain with huge knowledge drift and less truthfulness in its responses. Furthermore, for the safe fine-tuned model, ASR decreases by 51.68% as compared to the basemodel, and Safe model also shown in minor drop in uncertainty and truthfulness as compared to basemodel. This paper's code is available at: <a class="link-external link-https" href="https://github.com/techsachinkr/Overriding_Model_Safety_Protections" rel="external noopener nofollow">this https URL</a>

Chained Tuning Leads to Biased Forgetting

Learning and Forgetting Unsafe Examples in Large Language Models

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

Scaling Laws for Forgetting When Fine-Tuning Large Language Models

Unforgettable Generalization in Language Models

Revisiting Catastrophic Forgetting in Large Language Model Tuning

Understanding Catastrophic Forgetting in Language Models via Implicit Inference

Demystifying Language Model Forgetting with Low-rank Example Associations

Overriding Safety protections of Open-source Models

Dissecting Learning and Forgetting in Language Model Finetuning

Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

Answer When Needed, Forget When Not: Language Models Pretend to Forget via In-Context Knowledge Unlearning

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient Tuning

What Makes and Breaks Safety Fine-tuning? A Mechanistic Study

Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models

Dissecting Fine-Tuning Unlearning in Large Language Models

Exploring Forgetting in Large Language Model Pre-Training

SwitchCIT: Switching for Continual Instruction Tuning of Large Language Models

What Will My Model Forget? Forecasting Forgotten Examples in Language Model Refinement