Genshin: General Shield for Natural Language Processing with Large Language Models

Xiao Peng,Tao Liu,Ying Wang

2024-06-03

Abstract:Large language models (LLMs) like ChatGPT, Gemini, or LLaMA have been trending recently, demonstrating considerable advancement and generalizability power in countless domains. However, LLMs create an even bigger black box exacerbating opacity, with interpretability limited to few approaches. The uncertainty and opacity embedded in LLMs' nature restrict their application in high-stakes domains like financial fraud, phishing, etc. Current approaches mainly rely on traditional textual classification with posterior interpretable algorithms, suffering from attackers who may create versatile adversarial samples to break the system's defense, forcing users to make trade-offs between efficiency and robustness. To address this issue, we propose a novel cascading framework called Genshin (General Shield for Natural Language Processing with Large Language Models), utilizing LLMs as defensive one-time plug-ins. Unlike most applications of LLMs that try to transform text into something new or structural, Genshin uses LLMs to recover text to its original state. Genshin aims to combine the generalizability of the LLM, the discrimination of the median model, and the interpretability of the simple model. Our experiments on the task of sentimental analysis and spam detection have shown fatal flaws of the current median models and exhilarating results on LLMs' recovery ability, demonstrating that Genshin is both effective and efficient. In our ablation study, we unearth several intriguing observations. Utilizing the LLM defender, a tool derived from the 4th paradigm, we have reproduced BERT's 15% optimal mask rate results in the 3rd paradigm of NLP. Additionally, when employing the LLM as a potential adversarial tool, attackers are capable of executing effective attacks that are nearly semantically lossless.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper proposes a new framework called "Genshin," aimed at addressing the vulnerability of large language models (LLMs) to adversarial text attacks. Specifically: 1. **Defense against Adversarial Attacks**: - While current LLMs perform well, they lack in interpretability and robustness. The paper introduces the Genshin framework to enhance the defense capabilities of LLMs, enabling them to restore maliciously tampered text to its original state. 2. **Dual Nature of Adversarial Attacks**: - The paper points out that LLMs can be used not only as defense tools but also as potential attack tools. Therefore, the Genshin framework aims to balance these two uses. 3. **Interpretability and Efficiency**: - The paper emphasizes the necessity of improving model interpretability while maintaining efficiency. By combining medium-sized language models (LMs) and interpretable models (IMs), the Genshin framework achieves this goal. 4. **Experimental Validation**: - The paper validates the effectiveness and efficiency of the Genshin framework through experiments on sentiment analysis and spam detection tasks, demonstrating its recovery capability in adversarial attacks. 5. **Future Work Directions**: - The paper discusses possible future research directions, including improving the controllability of LLM attackers, developing new prompting techniques, and applying Genshin to multimodal data processing. In summary, this paper aims to enhance the robustness and interpretability of LLMs in adversarial attacks through the Genshin framework and explores its potential in various application scenarios.

Genshin: General Shield for Natural Language Processing with Large Language Models

Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models

Adversarial Attacks on Large Language Model-Based System and Mitigating Strategies: A Case Study on ChatGPT

A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly

LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods

AcademicGPT: Empowering Academic Research

Safety Assessment of Chinese Large Language Models

Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

Defending Large Language Models Against Attacks With Residual Stream Activation Analysis

Fortifying Ethical Boundaries in AI: Advanced Strategies for Enhancing Security in Large Language Models

G3Detector: General GPT-Generated Text Detector

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

Enhancing Robustness of LLM-Synthetic Text Detectors for Academic Writing: A Comprehensive Analysis

ShieldGPT: an LLM-based Framework for DDoS Mitigation

DetectGPT-SC: Improving Detection of Text Generated by Large Language Models through Self-Consistency with Masked Predictions

Generating Valid and Natural Adversarial Examples with Large Language Models

Large Language Model Sentinel: LLM Agent for Adversarial Purification

Beat LLMs at Their Own Game: Zero-Shot LLM-Generated Text Detection Via Querying ChatGPT.

Unveiling Large Language Models Generated Texts: A Multi-Level Fine-Grained Detection Framework

On the Generalization Ability of Machine-Generated Text Detectors