Abstract:Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when misbehaving. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.

What problem does this paper attempt to address?

This paper attempts to solve the problem of detecting improper behaviors that may occur in the practical applications of large - language models (LLMs). Specifically, although large - language models have achieved remarkable success in multiple fields, their ability to generate untrue, biased, or harmful responses poses significant risks to critical application scenarios. These improper behaviors include, but are not limited to: 1. **Generating untrue responses**: LLMs may inadvertently produce false information that seems reasonable but is actually fictional, thus misleading users or distorting facts. 2. **Being maliciously exploited**: Through so - called "jailbreak attacks", the security mechanisms of LLMs may be bypassed, resulting in the generation of harmful outputs. 3. **Generating toxic content**: For example, insulting or offensive content. 4. **Generating biased responses**: These responses may manifest as discriminatory or biased remarks, which are especially concerning because they may reinforce stereotypes and undermine society's efforts towards equality and inclusion. Existing methods usually focus on specific types of improper behaviors, which limits their overall effectiveness and requires the integration of multiple systems to comprehensively deal with diverse improper behaviors. Moreover, many methods rely on analyzing the responses of models, which may be inefficient or even ineffective when dealing with longer outputs and are also vulnerable to adaptive adversarial attacks. Therefore, there is an urgent need for a more general and powerful method for detecting improper behaviors that can identify and mitigate various forms of LLM improper behaviors. To address this need, this paper introduces a new technology named LLMS CAN. LLMS CAN is based on causal analysis and detects improper behaviors by monitoring different manifestations of the internal working principles of LLM. Specifically, LLMS CAN effectively detects improper behaviors by analyzing the causal contributions of input tokens and transformer layers. Experimental results show that there are obvious differences in the causal distributions between normal behaviors and improper behaviors, which makes it possible to develop accurate and lightweight detectors suitable for multiple improper - behavior - detection tasks.

LLMScan: Causal Scan for LLM Misbehavior Detection

LLMScan: Causal Scan for LLM Misbehavior Detection

Causality for Large Language Models

Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method

Rethinking the Development of Large Language Models from the Causal Perspective: A Legal Text Prediction Case Study

Can LLM-Generated Misinformation Be Detected?

SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering

LLM4Causal: Democratized Causal Tools for Everyone via Large Language Model

From Query Tools to Causal Architects: Harnessing Large Language Models for Advanced Causal Discovery from Data

Exposing LLM Vulnerabilities: Adversarial Scam Detection and Performance

Can LLMs be Fooled? Investigating Vulnerabilities in LLMs

LLM Factoscope: Uncovering LLMs' Factual Discernment through Inner States Analysis

From Pre-training Corpora to Large Language Models: What Factors Influence LLM Performance in Causal Discovery Tasks?

Are You Human? An Adversarial Benchmark to Expose LLMs

Utilizing LLMs for Enhanced Argumentation and Extraction of Causal Knowledge from Scientific Literature

Causality Analysis for Evaluating the Security of Large Language Models

Are you still on track!? Catching LLM Task Drift with Activations

Knowledge is Power: Understanding Causality Makes Legal judgment Prediction Models More Generalizable and Robust

Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward

Look Within, Why LLMs Hallucinate: A Causal Perspective

LLMGuard: Guarding Against Unsafe LLM Behavior