Abstract:Large language models (LLMs) with billions of parameters have sparked a new wave of exciting AI applications. However, their high computational costs and memory demands during inference pose significant challenges. Adaptive sparse activation inference, which activates only a small number of neurons for each token, offers a novel way to accelerate model inference without degrading performance, showing great potential for resource-constrained hardware devices. Nevertheless, existing methods predict activated neurons based on individual tokens with additional MLP, which involve frequent changes in activation maps and resource calls, limiting the acceleration benefits of sparse activation. In this paper, we introduce CoreInfer, an MLP-free adaptive sparse activation inference method based on sentence-level prediction. Specifically, we propose the concept of sentence-wise core neurons, which refers to the subset of neurons most critical for a given sentence, and empirically demonstrate its effectiveness. To determine the core neurons, we explore the correlation between core neurons and the sentence's semantics. Remarkably, we discovered that core neurons exhibit both stability and similarity in relation to the sentence's semantics -- an insight overlooked by previous studies. Building on this finding, we further design two semantic-based methods for predicting core neurons to fit different input scenarios. In CoreInfer, the core neurons are determined during the pre-filling stage and fixed during the encoding stage, enabling zero-cost sparse inference. We evaluated the model generalization and task generalization of CoreInfer across various models and tasks. Notably, on an NVIDIA TITAN XP GPU, CoreInfer achieved a 10.33 times and 2.72 times speedup compared to the Huggingface implementation and PowerInfer, respectively.

Large Language Model Inference Acceleration Based on Hybrid Model Branch Prediction

Inference acceleration for large language models using "stairs" assisted greedy generation

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy

Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy

Minions: Accelerating Large Language Model Inference with Aggregated Speculative Execution

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Inference Acceleration for Large Language Models on CPUs

Efficient and Economic Large Language Model Inference with Attention Offloading

Self-Selected Attention Span for Accelerating Large Language Model Inference

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

Inference Performance Optimization for Large Language Models on CPUs

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Better & Faster Large Language Models via Multi-token Prediction

CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding

Inference with Reference: Lossless Acceleration of Large Language Models