Abstract:Since the proposal of transformers, these models have been limited to bounded input lengths, because of their need to attend to every token in the input. In this work, we propose Unlimiformer: a general approach that wraps any existing pretrained encoder-decoder transformer, and offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index, while the returned kNN distances are the attention dot-product scores. This kNN index can be kept on either the GPU or CPU memory and queried in sub-linear time; this way, we can index practically unlimited input sequences, while every attention head in every decoder layer retrieves its top-k keys, instead of attending to every key. We evaluate Unlimiformer on several long-document and book-summarization benchmarks, showing that it can process even 500k token-long inputs from the BookSum dataset, without any input truncation at test time. We demonstrate that Unlimiformer improves pretrained models such as BART and Longformer by extending them to unlimited inputs without additional learned weights and without modifying their code. We make our code and models publicly available at <a class="link-external link-https" href="https://github.com/abertsch72/unlimiformer" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to enable pre - trained Transformer models to process input sequences of unlimited length, especially when dealing with tasks such as long - document and book summarization, avoid input truncation without adding extra learning parameters or modifying the model code. ### Background Traditional Transformer models, due to the quadratic complexity of their self - attention mechanisms, can usually only process input sequences of fixed length (for example, 512 or 1024 tokens). For tasks that require processing longer inputs, such as long - document and book summarization, these models cannot be directly extended because simple extension will lead to a sharp increase in computational cost. Some specially - designed long - context models (such as Longformer) handle longer inputs through sparsification or approximate attention mechanisms, but these methods usually require retraining the model, which is very computationally expensive. ### Problem The paper "Unlimiformer: Long - Range Transformers with Unlimited Length Input" proposes a general method, called Unlimiformer, which can enable existing pre - trained encoder - decoder Transformer models to process input of unlimited length without adding extra parameters and without modifying the model code. Specifically, Unlimiformer solves this problem in the following ways: 1. **kNN Indexing**: Unlimiformer uses a k - nearest neighbor (kNN) index to store all hidden states of the input sequence. This index can be stored in GPU or CPU memory and can be queried in sub - linear time. 2. **Retrieval - enhanced Cross - attention**: Before the cross - attention calculation in each decoder layer, Unlimiformer performs a kNN search and selects the top k hidden states for each attention head for attention calculation. In this way, each attention head only needs to focus on the top k most relevant input tokens instead of all input tokens. 3. **Attention Rewriting**: To achieve the above goals, Unlimiformer rewrites the standard Transformer attention formula so that a single index can be used for retrieval across all attention heads and all decoder layers without the need to build an index separately for each head and layer. ### Experimental Results The paper evaluates Unlimiformer on multiple long - document and book - summarization datasets, and the results show: - **Low - cost - of - computation Training Method**: Even without additional training, Unlimiformer can significantly improve the performance of the baseline model. For example, on the GovReport and SummScreen datasets, Unlimiformer improves the BART baseline model by 1.8 ROUGE - 1 points respectively without additional training. - **Early Stopping**: Using Unlimiformer for early stopping on the validation set can further improve the model performance without increasing the training cost. - **Long - range Training Method**: Using methods such as random coding training, retrieval training, and alternating training, Unlimiformer outperforms other baseline models, such as SLED and Memorizing Transformers, on multiple datasets. ### Conclusion Unlimiformer provides an effective method to enable existing pre - trained Transformer models to process input of unlimited length without retraining the model or adding extra parameters. This is especially useful when dealing with tasks such as long - document and book summarization and can significantly improve the model's performance.

Unlimiformer: Long-Range Transformers with Unlimited Length Input

On The Adaptation of Unlimiformer for Decoder-Only Transformers

Longformer: The Long-Document Transformer

LongNet: Scaling Transformers to 1,000,000,000 Tokens

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Reformer: The Efficient Transformer

LSG Attention: Extrapolation of pretrained Transformers to long sequences

Scavenging Hyena: Distilling Transformers into Long Convolution Models

FIT: Far-reaching Interleaved Transformers

Attention is All you Need

Attention as an RNN

BudgetLongformer: Can we Cheaply Pretrain a SotA Legal Language Model From Scratch?

LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models

Sumformer: Universal Approximation for Efficient Transformers

Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers

Efficient Memory-Enhanced Transformer for Long-Document Summarization in Low-Resource Regimes

Adaptive Multi-Resolution Attention with Linear Complexity

Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers

IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs