Abstract:Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to further split training data apart by the semantic clusters within each source. Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset. Additionally, we conceptually connect our clustering approach to both the Topic Aware Sampling (TAS) aspect of the TAS-B methodology and the nearest-neighbor-based hard-negative mining aspect of the ANCE methodology and discuss how this unified view motivates future lines of research on the organization of contrastive pretraining data.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily explores how to improve contrastive pretraining through data embedding and clustering. Specifically, the authors attempt to further subdivide the data to enhance model performance and theoretically explain the effectiveness of this approach. #### Main Research Objectives: 1. **Application Objective**: Extend the method of hierarchical data sourcing to further enhance the learning dynamics of large-scale contrastive pretraining. 2. **Theoretical Objective**: Develop a deeper theoretical understanding to explain how hierarchical data sourcing drives these performance improvements. #### Method Overview: - **Data Processing**: Use a pretrained text embedding model and the classic K-means clustering algorithm to subdivide the training data. - **Experimental Results**: Experiments conducted on the MSMARCO dataset show that this method significantly improves the NDCG@10 score of the BERT baseline model. - **Theoretical Connection**: Relate this method to previous Task Aware Sampling (TAS) and Approximate Nearest Neighbor Negative Contrastive Estimation (ANCE) methods, proposing a unified perspective. #### Experimental Findings: - On the MSMARCO dataset, clustering query and document embeddings resulted in approximately a 2% improvement in model performance. - Comparison across different datasets shows that this method performs better on certain specific tasks (e.g., FiQA2018) but worse on others (e.g., ClimateFEVER). #### Theoretical Analysis: - The authors propose a geometric argument, using the triangle inequality to explain why negative samples of the same topic are more useful than those of different topics. - This method may require finer-grained clustering to achieve better results. #### Limitations and Alternatives: - Curriculum Learning needs to be considered, as using clustered data in the early stages of training may not be effective. - Direct hard negative mining, while theoretically feasible, is less efficient, making the clustering method more practical in practice. #### Future Work Directions: - Explore smaller, denser clustering methods. - Improve clustering algorithms by combining query and project information. - Enhance data filtering and efficiency by removing redundant or irrelevant negative samples. Overall, the paper empirically validates that data embedding and clustering can significantly enhance the effectiveness of large-scale contrastive pretraining and provides a reasonable theoretical explanation.

Embedding And Clustering Your Data Can Improve Contrastive Pretraining

Text and Code Embeddings by Contrastive Pre-Training

Contrastive Learning with Transformer Initialization and Clustering Prior for Text Representation

Clustering swap prediction for image-text pre-training

An Empirical Study on Clustering Pretrained Embeddings: Is Deep Strictly Better?

Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval

A Mutually Reinforced Framework for Pretrained Sentence Embeddings

HiCL: Hierarchical Contrastive Learning of Unsupervised Sentence Embeddings

Contrastive encoder pre-training-based clustered federated learning for heterogeneous data

Self-supervised Document Clustering Based on BERT with Data Augment

Best of Both Worlds: Multimodal Contrastive Learning with Tabular and Imaging Data

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Combining Denoising Autoencoders with Contrastive Learning to fine-tune Transformer Models

Dual-Level Cross-Modal Contrastive Clustering

From Pretext to Purpose: Batch-Adaptive Self-Supervised Learning

Linking Representations with Multimodal Contrastive Learning

Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging

Exploiting Data Hierarchy as a New Modality for Contrastive Learning

On the Language Encoder of Contrastive Cross-modal Models

Developing Healthcare Language Model Embedding Spaces