JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources

Benjamin Clavié

2024-07-30

Abstract:Neural Information Retrieval has advanced rapidly in high-resource languages, but progress in lower-resource ones such as Japanese has been hindered by data scarcity, among other challenges. Consequently, multilingual models have dominated Japanese retrieval, despite their computational inefficiencies and inability to capture linguistic nuances. While recent multi-vector monolingual models like JaColBERT have narrowed this gap, they still lag behind multilingual methods in large-scale evaluations. This work addresses the suboptimal training methods of multi-vector retrievers in lower-resource settings, focusing on Japanese. We systematically evaluate and improve key aspects of the inference and training settings of JaColBERT, and more broadly, multi-vector models. We further enhance performance through a novel checkpoint merging step, showcasing it to be an effective way of combining the benefits of fine-tuning with the generalization capabilities of the original checkpoint. Building on our analysis, we introduce a novel training recipe, resulting in the JaColBERTv2.5 model. JaColBERTv2.5, with only 110 million parameters and trained in under 15 hours on 4 A100 GPUs, significantly outperforms all existing methods across all common benchmarks, reaching an average score of 0.754, significantly above the previous best of 0.720. To support future research, we make our final models, intermediate checkpoints and all data used publicly available.

Information Retrieval,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The paper aims to address issues in the field of Japanese information retrieval, especially under resource-constrained conditions. Specifically: 1. **Problems with existing methods**: Although multilingual models dominate Japanese retrieval, these models suffer from low computational efficiency and difficulty in capturing subtle language differences. Additionally, existing monolingual models (such as JaColBERT) have shown improvements but still lag behind multilingual methods in large-scale benchmarks. 2. **Optimization goals**: The paper proposes an optimized training method for multi-vector retrievers to overcome the aforementioned issues. By systematically evaluating and improving key settings of JaColBERT, the authors developed the JaColBERTv2.5 model, which significantly outperforms existing methods, including the best multilingual models, in multiple benchmarks. 3. **Technical contributions**: The paper introduces several novel approaches to enhance model performance, including dynamic query length, systematic evaluation of knowledge distillation strategies, and checkpoint fusion steps. These technical improvements enable JaColBERTv2.5 to achieve excellent results across multiple datasets and accomplish this under resource-constrained conditions. In summary, the goal of the paper is to improve retrieval performance in resource-constrained Japanese environments by optimizing the training method of multi-vector retrievers, thereby surpassing existing multilingual and monolingual models.

JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources

Towards Better Monolingual Japanese Retrievers with Multi-Vector Models

Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval

ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction

Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models

Simple Yet Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval

Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT

Parallelizing and Optimizing Neural Encoder–Decoder Models Without Padding on Multi-Core Architecture

Enhancing Low-Resource NMT with a Multilingual Encoder and Knowledge Distillation: A Case Study

Multi-BERT: Leveraging Adapters and Prompt Tuning for Low-Resource Multi-Domain Adaptation

More Room for Language: Investigating the Effect of Retrieval on Language Models

Neurocache: Efficient Vector Retrieval for Long-range Language Modeling

A Retrieval-Augmented Generation Based Large Language Model Benchmarked On a Novel Dataset

BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers

Not All Languages are Equal: Insights into Multilingual Retrieval-Augmented Generation

Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on Japanese

PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers

Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual Retrieval

NAIL: Lexical Retrieval Indices with Efficient Non-Autoregressive Decoders