Genshin: General Shield for Natural Language Processing with Large Language Models

Xiao Peng,Tao Liu,Ying Wang
2024-06-03
Abstract:Large language models (LLMs) like ChatGPT, Gemini, or LLaMA have been trending recently, demonstrating considerable advancement and generalizability power in countless domains. However, LLMs create an even bigger black box exacerbating opacity, with interpretability limited to few approaches. The uncertainty and opacity embedded in LLMs' nature restrict their application in high-stakes domains like financial fraud, phishing, etc. Current approaches mainly rely on traditional textual classification with posterior interpretable algorithms, suffering from attackers who may create versatile adversarial samples to break the system's defense, forcing users to make trade-offs between efficiency and robustness. To address this issue, we propose a novel cascading framework called Genshin (General Shield for Natural Language Processing with Large Language Models), utilizing LLMs as defensive one-time plug-ins. Unlike most applications of LLMs that try to transform text into something new or structural, Genshin uses LLMs to recover text to its original state. Genshin aims to combine the generalizability of the LLM, the discrimination of the median model, and the interpretability of the simple model. Our experiments on the task of sentimental analysis and spam detection have shown fatal flaws of the current median models and exhilarating results on LLMs' recovery ability, demonstrating that Genshin is both effective and efficient. In our ablation study, we unearth several intriguing observations. Utilizing the LLM defender, a tool derived from the 4th paradigm, we have reproduced BERT's 15% optimal mask rate results in the 3rd paradigm of NLP. Additionally, when employing the LLM as a potential adversarial tool, attackers are capable of executing effective attacks that are nearly semantically lossless.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper proposes a new framework called "Genshin," aimed at addressing the vulnerability of large language models (LLMs) to adversarial text attacks. Specifically: 1. **Defense against Adversarial Attacks**: - While current LLMs perform well, they lack in interpretability and robustness. The paper introduces the Genshin framework to enhance the defense capabilities of LLMs, enabling them to restore maliciously tampered text to its original state. 2. **Dual Nature of Adversarial Attacks**: - The paper points out that LLMs can be used not only as defense tools but also as potential attack tools. Therefore, the Genshin framework aims to balance these two uses. 3. **Interpretability and Efficiency**: - The paper emphasizes the necessity of improving model interpretability while maintaining efficiency. By combining medium-sized language models (LMs) and interpretable models (IMs), the Genshin framework achieves this goal. 4. **Experimental Validation**: - The paper validates the effectiveness and efficiency of the Genshin framework through experiments on sentiment analysis and spam detection tasks, demonstrating its recovery capability in adversarial attacks. 5. **Future Work Directions**: - The paper discusses possible future research directions, including improving the controllability of LLM attackers, developing new prompting techniques, and applying Genshin to multimodal data processing. In summary, this paper aims to enhance the robustness and interpretability of LLMs in adversarial attacks through the Genshin framework and explores its potential in various application scenarios.