Abstract:This paper presents an approach to improve text embedding models through contrastive fine-tuning on small datasets augmented with expert scores. It focuses on enhancing semantic textual similarity tasks and addressing text retrieval problems. The proposed method uses soft labels derived from expert-augmented scores to fine-tune embedding models, preserving their versatility and ensuring retrieval capability is improved. The paper evaluates the method using a Q\&A dataset from an online shopping website and eight expert models. Results show improved performance over a benchmark model across multiple metrics on various retrieval tasks from the massive text embedding benchmark (MTEB). The method is cost-effective and practical for real-world applications, especially when labeled data is scarce.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the text embedding model through contrastive fine - tuning using expert - enhanced scores on small - scale datasets. Specifically, the paper focuses on enhancing the semantic text similarity task and solving the text retrieval problem. The proposed method utilizes soft labels derived from expert - enhanced scores to fine - tune the embedding model in order to maintain its generality and ensure that the retrieval ability is improved. This method is particularly effective for real - world applications, especially when labeled data is scarce. ### Key Points: 1. **Problem Background**: Text embedding models play a fundamental role in natural language processing (NLP) and are able to capture the semantic similarity between texts. However, when fine - tuning on small - scale datasets, traditional methods may lead to a decline in the generalization ability of the model, especially when dealing with diverse or open datasets. 2. **Research Objectives**: Improve the fine - tuning process of the text embedding model by introducing soft labels generated from expert - enhanced scores, thereby improving performance on specific downstream tasks while maintaining the generality and retrieval ability of the model. 3. **Method Innovation**: - **Hard Labels vs Soft Labels**: Traditional contrastive fine - tuning methods use hard labels (binary relevance), but hard labels may be too strict, making it difficult for the model to learn. The paper proposes using soft labels, which are based on the scores of multiple expert models and provide more nuanced guidance signals. - **Soft Label Generation**: - **Soft - 1**: If \( y_i = 1 \) (relevant), then \( \hat{y}_i=\max\{s_{1,i}, s_{2,i},\ldots, s_{K,i}\} \); if \( y_i = 0 \) (not relevant), then \( \hat{y}_i = \min\{s_{1,i}, s_{2,i},\ldots, s_{K,i}\} \). - **Soft - 2**: \( \hat{y}_i=\text{mean}\{s_{1,i}, s_{2,i},\ldots, s_{K,i}\} \). - **Soft - 3**: If \( y_i = 1 \), then \( \hat{y}_i=\text{mean}\{s^{(1)}_i, s^{(2)}_i\} \); if \( y_i = 0 \), then \( \hat{y}_i=\text{mean}\{s^{(K - 1)}_i, s^{(K)}_i\} \), where \( s^{(1)}_i \) and \( s^{(2)}_i \) are the two highest scores, and \( s^{(K - 1)}_i \) and \( s^{(K)}_i \) are the two lowest scores. 4. **Experimental Verification**: - Experiments were carried out using a question - and - answer dataset from an online shopping website. - A large number of similar question pairs were generated and evaluated using eight expert models. - The experimental results show that the model using soft labels outperforms the baseline model on multiple evaluation metrics, especially performing prominently on the nDCG@10 and mAP@10 metrics. ### Conclusion: The paper proposes a new fine - tuning framework to improve the text embedding model by using soft labels generated from expert - enhanced scores. This method performs well on small - scale datasets, can improve performance on specific tasks, and at the same time maintain the generality and retrieval ability of the model. This provides an effective solution for real - world applications, especially when labeled data is scarce.

Improving embedding with contrastive fine-tuning on small datasets with expert-augmented scores

Using Context-to-Vector with Graph Retrofitting to Improve Word Embeddings

OssCSE: Overcoming Surface Structure Bias in Contrastive Learning for Unsupervised Sentence Embedding

Improving Text Embeddings for Smaller Language Models Using Contrastive Fine-tuning

Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting

Contrastive Learning and Mixture of Experts Enables Precise Vector Embeddings

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging

REFINE on Scarce Data: Retrieval Enhancement through Fine-Tuning via Model Fusion of Embedding Models

Towards Robust Text Retrieval with Progressive Learning

Improving Text Embeddings with Large Language Models

SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data

Finetuning CLIP to Reason about Pairwise Differences

Refining Joint Text and Source Code Embeddings for Retrieval Task with Parameter-Efficient Fine-Tuning

GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

Improving Embedding-based Large-scale Retrieval Via Label Enhancement.

Unsupervised Document Embedding via Contrastive Augmentation

Improved Visual Fine-tuning with Natural Language Supervision

Text and Code Embeddings by Contrastive Pre-Training

TeC: A Novel Method for Text Clustering with Large Language Models Guidance and Weakly-Supervised Contrastive Learning

The Short Text Matching Model Enhanced with Knowledge via Contrastive Learning