Offline Textual Adversarial Attacks Against Large Language Models

Huijun Liu,Bin Ji,Jie Yu,Shasha Li,Jun Ma,Miaomiao Li,Xi Wang
DOI: https://doi.org/10.1109/ijcnn60899.2024.10650921
2024-01-01
Abstract:This work centers on textual adversarial attacks against large language models (LLMs) and proposes a new reproducible benchmark for future study. Unlike pre-trained language models (PLMs) which can output predicted class probabilities as feedback to instruct the generation of adversarial examples, LLMs cannot accurately provide such feedback due to their generative nature, making existing attack modes unsuitable. To address this issue, we propose Offline-Attack, an offline method tailored for LLMs that contains a novel Transformer-based Adversarial Machine Translation (AMT) framework. AMT is trained on one self-constructed large-scale adversarial dataset and used to translate original texts to adversarial examples. To mitigate training bias, we induce LLMs to generate stable prediction confidence and incorporate it into AMT training process. The evaluation, spanning four text classification datasets against LLaMA-2-13b-chat, showcases Offline-Attack’s robust performance, particularly achieving 44.3% attack success rate on average. Moreover, Offline-Attack exhibits promising attack ability to other LLMs like Vicuna-33b and ChatGPT. Our study paves the way for future study by presenting strong and reproducible baselines for textual adversarial attacks against LLMs.
What problem does this paper attempt to address?