Improving embedding with contrastive fine-tuning on small datasets with expert-augmented scores

Jun Lu,David Li,Bill Ding,Yu Kang
2024-08-19
Abstract:This paper presents an approach to improve text embedding models through contrastive fine-tuning on small datasets augmented with expert scores. It focuses on enhancing semantic textual similarity tasks and addressing text retrieval problems. The proposed method uses soft labels derived from expert-augmented scores to fine-tune embedding models, preserving their versatility and ensuring retrieval capability is improved. The paper evaluates the method using a Q\&A dataset from an online shopping website and eight expert models. Results show improved performance over a benchmark model across multiple metrics on various retrieval tasks from the massive text embedding benchmark (MTEB). The method is cost-effective and practical for real-world applications, especially when labeled data is scarce.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the text embedding model through contrastive fine - tuning using expert - enhanced scores on small - scale datasets. Specifically, the paper focuses on enhancing the semantic text similarity task and solving the text retrieval problem. The proposed method utilizes soft labels derived from expert - enhanced scores to fine - tune the embedding model in order to maintain its generality and ensure that the retrieval ability is improved. This method is particularly effective for real - world applications, especially when labeled data is scarce. ### Key Points: 1. **Problem Background**: Text embedding models play a fundamental role in natural language processing (NLP) and are able to capture the semantic similarity between texts. However, when fine - tuning on small - scale datasets, traditional methods may lead to a decline in the generalization ability of the model, especially when dealing with diverse or open datasets. 2. **Research Objectives**: Improve the fine - tuning process of the text embedding model by introducing soft labels generated from expert - enhanced scores, thereby improving performance on specific downstream tasks while maintaining the generality and retrieval ability of the model. 3. **Method Innovation**: - **Hard Labels vs Soft Labels**: Traditional contrastive fine - tuning methods use hard labels (binary relevance), but hard labels may be too strict, making it difficult for the model to learn. The paper proposes using soft labels, which are based on the scores of multiple expert models and provide more nuanced guidance signals. - **Soft Label Generation**: - **Soft - 1**: If \( y_i = 1 \) (relevant), then \( \hat{y}_i=\max\{s_{1,i}, s_{2,i},\ldots, s_{K,i}\} \); if \( y_i = 0 \) (not relevant), then \( \hat{y}_i = \min\{s_{1,i}, s_{2,i},\ldots, s_{K,i}\} \). - **Soft - 2**: \( \hat{y}_i=\text{mean}\{s_{1,i}, s_{2,i},\ldots, s_{K,i}\} \). - **Soft - 3**: If \( y_i = 1 \), then \( \hat{y}_i=\text{mean}\{s^{(1)}_i, s^{(2)}_i\} \); if \( y_i = 0 \), then \( \hat{y}_i=\text{mean}\{s^{(K - 1)}_i, s^{(K)}_i\} \), where \( s^{(1)}_i \) and \( s^{(2)}_i \) are the two highest scores, and \( s^{(K - 1)}_i \) and \( s^{(K)}_i \) are the two lowest scores. 4. **Experimental Verification**: - Experiments were carried out using a question - and - answer dataset from an online shopping website. - A large number of similar question pairs were generated and evaluated using eight expert models. - The experimental results show that the model using soft labels outperforms the baseline model on multiple evaluation metrics, especially performing prominently on the nDCG@10 and mAP@10 metrics. ### Conclusion: The paper proposes a new fine - tuning framework to improve the text embedding model by using soft labels generated from expert - enhanced scores. This method performs well on small - scale datasets, can improve performance on specific tasks, and at the same time maintain the generality and retrieval ability of the model. This provides an effective solution for real - world applications, especially when labeled data is scarce.