Abstract:Large language models (LLMs) possess extensive knowledge and question-answering capabilities, having been widely deployed in privacy-sensitive domains like finance and medical consultation. During LLM inferences, cache-sharing methods are commonly employed to enhance efficiency by reusing cached states or responses for the same or similar inference requests. However, we identify that these cache mechanisms pose a risk of private input leakage, as the caching can result in observable variations in response times, making them a strong candidate for a timing-based attack hint. In this study, we propose a novel timing-based side-channel attack to execute input theft in LLMs inference. The cache-based attack faces the challenge of constructing candidate inputs in a large search space to hit and steal cached user queries. To address these challenges, we propose two primary components. The input constructor employs machine learning techniques and LLM-based approaches for vocabulary correlation learning while implementing optimized search mechanisms for generalized input construction. The time analyzer implements statistical time fitting with outlier elimination to identify cache hit patterns, continuously providing feedback to refine the constructor's search strategy. We conduct experiments across two cache mechanisms and the results demonstrate that our approach consistently attains high attack success rates in various applications. Our work highlights the security vulnerabilities associated with performance optimizations, underscoring the necessity of prioritizing privacy and security alongside enhancements in LLM inference.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the privacy leakage problem caused by the cache - sharing optimization mechanism during the inference process of large - language models (LLMs). Specifically, the paper focuses on the risk of stealing user input information from LLM services through timing side - channel attacks. #### Main problem background 1. **Cache - sharing optimization**: To improve efficiency, cache - sharing methods such as prefix caching and semantic caching are usually adopted during the LLM inference process. These methods accelerate the processing of the same or similar inference requests by reusing the cached state or response. 2. **Privacy leakage risk**: Although the cache mechanism improves performance, it also brings privacy risks. Since requests from different users may share the same cache, resulting in observable changes in response time, this provides clues for timing - based attacks. #### Core problems of the paper - **Security of the cache mechanism**: The paper points out that the existing cache mechanism may inadvertently leak users' private input information while improving performance. In particular, when the cache hits, the response time is significantly reduced, which can be exploited by attackers to infer the input content of other users. - **Effectiveness of the attack**: The paper proposes a new timing - based side - channel attack method - InputSnatch, which is used to steal user input during the LLM inference process. This method constructs candidate inputs and analyzes response times, identifies cache - hit patterns, and thus realizes partial or complete recovery of user input. #### Solutions and contributions - **Systematic investigation**: For the first time, systematically studied the application of timing - based side - channel attacks in LLM inference, analyzed the privacy leakage risks of two cache mechanisms (prefix caching and semantic caching) and the trade - off between performance and privacy. - **Comprehensive attack framework**: Proposed an attack method that combines multiple input construction strategies (such as machine - learning models, LLM analysis, and optimized search) with robust time analysis (statistical fitting and anomaly detection), demonstrating effective input reconstruction capabilities in various deployment scenarios. - **Practical effect verification**: Through experimental verification, the attack framework achieved a 62% success rate in partial input recovery, a 12.5% success rate in complete input extraction, and showed 79.5% effectiveness in semantic - level content reconstruction. ### Summary This paper reveals the privacy leakage risks brought by the cache - sharing optimization mechanism in LLM inference, and proposes a timing - based side - channel attack method - InputSnatch, emphasizing the importance of privacy and security while pursuing performance optimization.

InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks

The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks

Data Stealing Attacks against Large Language Models via Backdooring

Remote Timing Attacks on Efficient Language Model Inference

Defense Against Prompt Injection Attack by Leveraging Attack Techniques

Prompt Injection attack against LLM-integrated Applications

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Last One Standing: A Comparative Analysis of Security and Privacy of Soft Prompt Tuning, LoRA, and In-Context Learning

Attention Tracker: Detecting Prompt Injection Attacks in LLMs

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

Learning to Poison Large Language Models During Instruction Tuning

Automatic and Universal Prompt Injection Attacks against Large Language Models

CACHE SNIPER : Accurate timing control of cache evictions

Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

Hijacking Large Language Models via Adversarial In-Context Learning

Membership Inference Attacks Against In-Context Learning

Teach LLMs to Phish: Stealing Private Information from Language Models

Composite Backdoor Attacks Against Large Language Models

User Inference Attacks on Large Language Models

Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs