Abstract:Recently, large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task. Existing methods have explored utilizing LLMs as data annotators to generate synthesized data for training contrastive learning based sentence embedding models such as SimCSE. However, since contrastive learning models are sensitive to the quality of sentence pairs, the effectiveness of these methods is largely influenced by the content generated from LLMs, highlighting the need for more refined generation in the context of sentence representation learning. Building upon this premise, we propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus for training base sentence embedding models into three stages (i.e., sentence generation, sentence pair construction, in-batch training) and refines the generated content at these three distinct stages, ensuring only high-quality sentence pairs are utilized to train a base contrastive learning model. Our extensive experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results. Comprehensive analyses further underscore the potential of our framework in various application scenarios and achieving better sentence representation learning with LLMs.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue of low-quality content generated by large language models (LLMs) when producing data for sentence representation learning. Specifically, existing methods have explored using LLMs as data annotators to generate synthetic data for training contrastive learning-based sentence embedding models (such as SimCSE). However, since the effectiveness of contrastive learning models is highly sensitive to the quality of sentence pairs, the effectiveness of these methods largely depends on the quality of the content generated by LLMs. Therefore, a more refined generation method is needed to ensure that the sentence pairs used in sentence representation learning are of high quality. ### Solution To address the above issue, the authors propose a Multi-level Contrastive Sentence Representation Learning framework (MultiCSR). This framework decomposes the process of prompting LLMs to generate a corpus for training foundational sentence embedding models into three stages: 1. **Sentence Generation**: By using a contrastive generation strategy, opposite instructions are used to identify and correct obvious errors in the content generated by LLMs, thereby improving the quality of the generated content. 2. **Sentence Pair Construction**: LLMs self-curate generated sentence pairs, ensuring that only high-quality sentence pairs are included in the final training stage by measuring the semantic similarity of the sentence pairs. 3. **In-batch Training**: Using similarity masks provided by pre-trained sentence representation models, false negative samples are contrastively filtered out during the in-batch training process, further enhancing the training effect. ### Experimental Results Through extensive experiments on standard Semantic Textual Similarity (STS) tasks and multiple transfer tasks, the authors demonstrate the effectiveness of MultiCSR. The experimental results show that MultiCSR can enable a relatively lagging LLM (such as Flan-T5) to outperform ChatGPT, and when applied to ChatGPT, it can achieve better state-of-the-art results. Additionally, the authors conducted detailed ablation studies to verify the importance and contribution of each stage. ### Main Contributions 1. **Proposed a New Direction**: Improving sentence representation learning by refining the content generated by LLMs. 2. **Decomposed the Generation Process**: For the first time, the process of prompting LLMs to generate a corpus is decomposed into three stages, with contrastive strategies integrated at each stage for refinement. 3. **Extensive Experimental Validation**: Conducted extensive experiments on standard STS tasks and multiple transfer tasks to validate the effectiveness of the method. ### Conclusion MultiCSR significantly improves the quality of sentence representation learning through multi-level contrastive strategies, providing a new approach for generating high-quality data using LLMs.

Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning

Supervised Knowledge Makes Large Language Models Better In-context Learners

Narrowing the Gap between Supervised and Unsupervised Sentence Representation Learning with Large Language Model

Scaling Sentence Embeddings with Large Language Models

Generate, Discriminate and Contrast: A Semi-Supervised Sentence Representation Learning Framework

Sebgm: Sentence Embedding Based on Generation Model with Multi-Task Learning

Contrastive Learning Models for Sentence Representations

A Contrastive Framework to Enhance Unsupervised Sentence Representation Learning

Control Large Language Models via Divide and Conquer

Small Language Models Improve Giants by Rewriting Their Outputs

Enhancing Discriminative Tasks by Guiding the Pre-trained Language Model with Large Language Model's Experience

Simple Techniques for Enhancing Sentence Embeddings in Generative Language Models

Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency

Large Language Models aren't all that you need

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

Evaluating, Understanding, and Improving Constrained Text Generation for Large Language Models

Leveraging Large Language Models for NLG Evaluation: Advances and Challenges

Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning

Learning to Perturb for Contrastive Learning of Unsupervised Sentence Representations.

A Comprehensive Evaluation of Constrained Text Generation for Large Language Models.

CMLM-CSE: Based on Conditional MLM Contrastive Learning for Sentence Embeddings