Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning

Huiming Wang,Zhaodonghui Li,Liying Cheng,Soh De Wen,Lidong Bing
2024-05-17
Abstract:Recently, large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task. Existing methods have explored utilizing LLMs as data annotators to generate synthesized data for training contrastive learning based sentence embedding models such as SimCSE. However, since contrastive learning models are sensitive to the quality of sentence pairs, the effectiveness of these methods is largely influenced by the content generated from LLMs, highlighting the need for more refined generation in the context of sentence representation learning. Building upon this premise, we propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus for training base sentence embedding models into three stages (i.e., sentence generation, sentence pair construction, in-batch training) and refines the generated content at these three distinct stages, ensuring only high-quality sentence pairs are utilized to train a base contrastive learning model. Our extensive experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results. Comprehensive analyses further underscore the potential of our framework in various application scenarios and achieving better sentence representation learning with LLMs.
Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of low-quality content generated by large language models (LLMs) when producing data for sentence representation learning. Specifically, existing methods have explored using LLMs as data annotators to generate synthetic data for training contrastive learning-based sentence embedding models (such as SimCSE). However, since the effectiveness of contrastive learning models is highly sensitive to the quality of sentence pairs, the effectiveness of these methods largely depends on the quality of the content generated by LLMs. Therefore, a more refined generation method is needed to ensure that the sentence pairs used in sentence representation learning are of high quality. ### Solution To address the above issue, the authors propose a Multi-level Contrastive Sentence Representation Learning framework (MultiCSR). This framework decomposes the process of prompting LLMs to generate a corpus for training foundational sentence embedding models into three stages: 1. **Sentence Generation**: By using a contrastive generation strategy, opposite instructions are used to identify and correct obvious errors in the content generated by LLMs, thereby improving the quality of the generated content. 2. **Sentence Pair Construction**: LLMs self-curate generated sentence pairs, ensuring that only high-quality sentence pairs are included in the final training stage by measuring the semantic similarity of the sentence pairs. 3. **In-batch Training**: Using similarity masks provided by pre-trained sentence representation models, false negative samples are contrastively filtered out during the in-batch training process, further enhancing the training effect. ### Experimental Results Through extensive experiments on standard Semantic Textual Similarity (STS) tasks and multiple transfer tasks, the authors demonstrate the effectiveness of MultiCSR. The experimental results show that MultiCSR can enable a relatively lagging LLM (such as Flan-T5) to outperform ChatGPT, and when applied to ChatGPT, it can achieve better state-of-the-art results. Additionally, the authors conducted detailed ablation studies to verify the importance and contribution of each stage. ### Main Contributions 1. **Proposed a New Direction**: Improving sentence representation learning by refining the content generated by LLMs. 2. **Decomposed the Generation Process**: For the first time, the process of prompting LLMs to generate a corpus is decomposed into three stages, with contrastive strategies integrated at each stage for refinement. 3. **Extensive Experimental Validation**: Conducted extensive experiments on standard STS tasks and multiple transfer tasks to validate the effectiveness of the method. ### Conclusion MultiCSR significantly improves the quality of sentence representation learning through multi-level contrastive strategies, providing a new approach for generating high-quality data using LLMs.