Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

Michael Günther,Isabelle Mohr,Daniel James Williams,Bo Wang,Han Xiao
2024-10-02
Abstract:Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be over-compressed in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in sub-optimal representations. In this paper, we introduce a novel method called late chunking, which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling - hence the term late in its naming. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks. The method is generic enough to be applied to a wide range of long-context embedding models and works without additional training. To further increase the effectiveness of late chunking, we propose a dedicated fine-tuning approach for embedding models.
Computation and Language,Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in text retrieval, the traditional chunking method (i.e., splitting long texts into small paragraphs or sentences before encoding) will lead to the loss of context information, thus affecting the retrieval effect. Specifically, when the information in one text fragment needs to rely on the information in other fragments to be correctly understood, if these fragments are processed independently, then the model will have difficulty in capturing this long - distance semantic dependency relationship, resulting in a decline in the quality of the generated vector representation. To solve this problem, the paper proposes a new method - "late chunking". This method first uses an embedding model capable of processing long texts to encode the entire document and generate vector representations of each word. Then, it chunks these word - vector sequences according to a predetermined chunking strategy and generates the final vector representation of each chunk through an average pooling operation. In this way, the vector representation of each chunk contains context information from the entire document, thereby improving the performance of retrieval tasks. The paper verifies the effectiveness of the "late chunking" method through experiments. It not only achieves better results than the traditional chunking method on multiple datasets, but also proposes an extended algorithm for long documents (long late chunking), as well as a training method specifically used to enhance the performance of "late chunking" (span pooling). These contributions together prove the effectiveness and universality of "late chunking" as a technique for improving text retrieval effects.