GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

Daniel Goldstein,Fares Obeid,Eric Alcaide,Guangyu Song,Eugene Cheah

2024-07-17

Abstract:We introduce GoldFinch, a hybrid Linear Attention/Transformer sequence model that uses a new technique to efficiently generate a highly compressed and reusable KV-Cache in linear time and space with respect to sequence length. GoldFinch stacks our new GOLD transformer on top of an enhanced version of the Finch (RWKV-6) architecture. We train up to 1.5B parameter class models of the Finch, Llama, and GoldFinch architectures, and find dramatically improved modeling performance relative to both Finch and Llama. Our cache size savings increase linearly with model layer count, ranging from 756-2550 times smaller than the traditional transformer cache for common sizes, enabling inference of extremely large context lengths even on limited hardware. Although autoregressive generation has O(n) time complexity per token because of attention, pre-fill computation of the entire initial cache state for a submitted context costs only O(1) time per token due to the use of a recurrent neural network (RNN) to generate this cache. We release our trained weights and training code under the Apache 2.0 license for community use.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

This paper introduces a new sequence model called GoldFinch, which is a hybrid of linear attention and the Transformer architecture. GoldFinch generates highly compressed and reusable KV-Cache (key-value cache) in linear time and space complexity through a new technique. Compared to traditional Transformers, GoldFinch improves memory usage and computational efficiency, especially when dealing with long sequence contexts. GoldFinch consists of two main components: Finch-C2 (an improved version of Finch) and the GOLD layer. The Finch-C2 layer is used for prefilling, while the GOLD layer handles the compressed key cache to generate outputs without a traditional value cache. Through these innovations, GoldFinch significantly reduces the size of KV-Cache in large models, allowing for inference of large-scale context lengths on limited hardware. The features of GoldFinch include: 1. Generating small and compressed global key cache using the Finch-C2 layer. 2. Further reducing cache size by eliminating value cache and only storing input token indices. 3. Applying low-rank adaptation (LoRA) to compress the key cache, reducing its size by 128 times. 4. Utilizing input embedding table and token shifting in the style of RWKV to generate attention values. The paper also compares the time and space complexity of GoldFinch with other models, demonstrating that GoldFinch outperforms the Finch and Llama models in terms of performance and memory usage. Additionally, GoldFinch's prefilling and decoding efficiency have been improved, and it performs well in various benchmark tests. In summary, GoldFinch aims to address the memory and computational complexity issues faced by Transformer models when dealing with long sequences, achieving higher efficiency and performance through optimization and innovation.

GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

Finch: Prompt-guided Key-Value Cache Compression

Training-Free Exponential Context Extension via Cascading KV Cache

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management

SnapKV: LLM Knows What You are Looking for Before Generation

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models

A Method for Building Large Language Models with Predefined KV Cache Capacity

A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts

CORM: Cache Optimization with Recent Message for Large Language Model Inference

Cached Transformers: Improving Transformers with Differentiable Memory Cache

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

PQCache: Product Quantization-based KVCache for Long Context LLM Inference