Conan-embedding: General Text Embedding with More and Better Negative Samples

Shiyu Li,Yang Tang,Shizhe Chen,Xi Chen
2024-08-29
Abstract:With the growing popularity of RAG, the capabilities of embedding models are gaining increasing attention. Embedding models are primarily trained through contrastive loss learning, with negative examples being a key component. Previous work has proposed various hard negative mining strategies, but these strategies are typically employed as preprocessing steps. In this paper, we propose the conan-embedding model, which maximizes the utilization of more and higher-quality negative examples. Specifically, since the model's ability to handle preprocessed negative examples evolves during training, we propose dynamic hard negative mining method to expose the model to more challenging negative examples throughout the training process. Secondly, contrastive learning requires as many negative examples as possible but is limited by GPU memory constraints. Therefore, we use a Cross-GPU balancing Loss to provide more negative examples for embedding training and balance the batch size across multiple tasks. Moreover, we also discovered that the prompt-response pairs from LLMs can be used for embedding training. Our approach effectively enhances the capabilities of embedding models, currently ranking first on the Chinese leaderboard of Massive text embedding benchmark
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of how to more efficiently utilize a greater number of high-quality negative samples in text embedding models to improve performance across various downstream tasks. Specifically, existing text embedding models primarily rely on contrastive learning during training, and the quality of negative samples is crucial for model performance. However, current hard negative sample mining strategies are typically used as a preprocessing step, which limits the model's performance when dealing with complex and variable training data. To overcome these issues, the authors propose the Conan-embedding model, which maximizes the number and quality of negative samples through the following two key innovations: 1. **Dynamic Hard Negative Sample Mining**: Dynamically mines hard negative samples during training, allowing the model to adapt to continuously changing training data. 2. **Cross-GPU Batch Balance Loss**: Balances the number of negative samples across multiple GPUs, improving training efficiency and effectiveness. These improvements enabled the Conan-embedding model to achieve first place in the Chinese Massive Text Embedding Benchmark (CMTEB), demonstrating its outstanding performance and broad application prospects.