Text Laundering: Mitigating Malicious Features Through Knowledge Distillation of Large Foundation Models.

Yi Jiang,Chenghui Shi,Oubo Ma,Youliang Tian,Shouling Ji
DOI: https://doi.org/10.1007/978-981-97-0945-8_1
2024-01-01
Abstract:Despite their efficacy in machine learning, Deep Neural Networks (DNNs) are notoriously susceptible to backdoor and adversarial attacks. These attacks are characterized by manipulated features within the input layer, which subsequently compromise the DNN's output. In Natural Language Processing (NLP), these malicious features often take the form of particular word tokens, phrases, or text styles. Defending against these harmful elements has proven challenging. Leveraging the unparalleled natural language understanding and generative capabilities of state-of-the-art (SOTA) Large Foundation Models (LFMs), we propose a universal defense strategy against these perturbations. Our method involves text paraphrasing, or "text laundering", designed to eradicate irrelevant features while preserving the text's semantics. Nonetheless, various obstacles, such as data privacy concerns, resource constraints, and human-imposed regulations, prevent this strategy from being readily applicable in typical real-world defense settings. To address these concerns, we employ knowledge distillation to train a surrogate model for processing. Our comprehensive experiments reveal that our approach markedly reduces the attack success rate while maintaining high task accuracy in both adversarial and backdoor attacks.
What problem does this paper attempt to address?