Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, leading to their increasing adoption in diverse services delivered through wireless networks. There is a growing trend toward longer prompts to better leverage LLMs' capabilities and address difficult tasks. However, longer prompts not only increase data transmission costs across wireless transmission but also require more computing resources and processing time, impacting the overall system efficiency and user experience. To address this challenge, we propose Joint Power and Prompt Optimization (JPPO), a framework that combines Small Language Model (SLM)-based prompt compression with wireless power allocation optimization. By deploying SLM at edge devices for prompt compression and employing Deep Reinforcement Learning (DRL) for joint optimization of compression ratio and transmission power, JPPO effectively balances service quality with resource efficiency. Furthermore, inspired by denoising diffusion models, we design a denoising-inspired prompt compression approach that iteratively compresses prompts by gradually removing non-critical information. Experimental results demonstrate that our framework achieves high service fidelity while optimizing power usage in wireless LLM services, reducing the total service response time. With our DRL-based JPPO, the framework maintains fidelity comparable to the no-compression baseline while still achieving a 17% service time reduction through adaptive compression. When prioritizing compression, our framework achieves up to 16x compression ratio while maintaining acceptable fidelity (within 30% reduction). Compared to no compression, baseline single-round compression with a 16x compression ratio reduces the system total response time by approximately 42.3%, while the denoising-inspired method achieves a 46.5% service time-saving.

Discrete Prompt Compression With Reinforcement Learning

Discrete Prompt Compression with Reinforcement Learning

TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning

From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression

Learning to Compress Prompt in Natural Language Formats

Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt

500xCompressor: Generalized Prompt Compression for Large Language Models

Learning to Compress Prompts with Gist Tokens

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Prompt Compression for Large Language Models: A Survey

Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference

RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability

Network-aided Efficient Large Language Model Services With Denoising-inspired Prompt Compression

Say More with Less: Understanding Prompt Learning Behaviors through Gist Compression

Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles

SelfCP: Compressing Over-Limit Prompt via the Frozen Large Language Model Itself

LanguaShrink: Reducing Token Overhead with Psycholinguistics

PCToolkit: A Unified Plug-and-Play Prompt Compression Toolkit of Large Language Models

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression