Abstract:Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be over-compressed in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in sub-optimal representations. In this paper, we introduce a novel method called late chunking, which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling - hence the term late in its naming. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks. The method is generic enough to be applied to a wide range of long-context embedding models and works without additional training. To further increase the effectiveness of late chunking, we propose a dedicated fine-tuning approach for embedding models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in text retrieval, the traditional chunking method (i.e., splitting long texts into small paragraphs or sentences before encoding) will lead to the loss of context information, thus affecting the retrieval effect. Specifically, when the information in one text fragment needs to rely on the information in other fragments to be correctly understood, if these fragments are processed independently, then the model will have difficulty in capturing this long - distance semantic dependency relationship, resulting in a decline in the quality of the generated vector representation. To solve this problem, the paper proposes a new method - "late chunking". This method first uses an embedding model capable of processing long texts to encode the entire document and generate vector representations of each word. Then, it chunks these word - vector sequences according to a predetermined chunking strategy and generates the final vector representation of each chunk through an average pooling operation. In this way, the vector representation of each chunk contains context information from the entire document, thereby improving the performance of retrieval tasks. The paper verifies the effectiveness of the "late chunking" method through experiments. It not only achieves better results than the traditional chunking method on multiple datasets, but also proposes an extended algorithm for long documents (long late chunking), as well as a training method specifically used to enhance the performance of "late chunking" (span pooling). These contributions together prove the effectiveness and universality of "late chunking" as a technique for improving text retrieval effects.

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

Using Context-to-Vector with Graph Retrofitting to Improve Word Embeddings

Global and Compact Video Context Embedding for Video Semantic Segmentation

BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models

LongEmbed: Extending Embedding Models for Long Context Retrieval

Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers

Training With "Paraphrasing the Original Text'' Improves Long-Context Performance

Enhance Social Context Understanding with Semantic Chunks.

Fast Extraction of Word Embedding from Q-contexts

Contextual Document Embeddings

Extensible Embedding: A Flexible Multipler For LLM's Context Length

LumberChunker: Long-Form Narrative Document Segmentation

Long-Context Language Modeling with Parallel Context Encoding

Empower Your Model with Longer and Better Context Comprehension

CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling

SEGMENT+: Long Text Processing with Short-Context Language Models

Training-Free Long-Context Scaling of Large Language Models

Efficient Document Ranking with Learnable Late Interactions

Shifted Chunk Encoder for Transformer Based Streaming End-to-End ASR

Lost in the Middle: How Language Models Use Long Contexts

MemLong: Memory-Augmented Retrieval for Long Text Modeling