Abstract:As safety remains a crucial concern throughout the development lifecycle of Large Language Models (LLMs), researchers and industrial practitioners have increasingly focused on safeguarding and aligning LLM behaviors with human preferences and ethical standards. LLMs, trained on extensive multilingual corpora, exhibit powerful generalization abilities across diverse languages and domains. However, current safety alignment practices predominantly focus on single-language scenarios, which leaves their effectiveness in complex multilingual contexts, especially for those complex mixed-language formats, largely unexplored. In this study, we introduce Multilingual Blending, a mixed-language query-response scheme designed to evaluate the safety alignment of various state-of-the-art LLMs (e.g., GPT-4o, GPT-3.5, Llama3) under sophisticated, multilingual conditions. We further investigate language patterns such as language availability, morphology, and language family that could impact the effectiveness of Multilingual Blending in compromising the safeguards of LLMs. Our experimental results show that, without meticulously crafted prompt templates, Multilingual Blending significantly amplifies the detriment of malicious queries, leading to dramatically increased bypass rates in LLM safety alignment (67.23% on GPT-3.5 and 40.34% on GPT-4o), far exceeding those of single-language baselines. Moreover, the performance of Multilingual Blending varies notably based on intrinsic linguistic properties, with languages of different morphology and from diverse families being more prone to evading safety alignments. These findings underscore the necessity of evaluating LLMs and developing corresponding safety alignment strategies in a complex, multilingual context to align with their superior cross-language generalization capabilities.

Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching

Superficial Safety Alignment Hypothesis

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

A safety realignment framework via subspace-oriented model fusion for large language models

Finding Safety Neurons in Large Language Models

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

Locking Down the Finetuned LLMs Safety

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

Safety Alignment for Vision Language Models

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Robustifying Safety-Aligned Large Language Models through Clean Data Curation

Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack

Safety Layers in Aligned Large Language Models: The Key to LLM Security

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture