Abstract:Background. Code search aims to find the most relevant code snippet in a large codebase based on a given natural language query. An accurate code search engine can increase code reuse and improve programming efficiency. The focus of code search is how to represent the semantic similarity of code and query. With the development of code pre-trained models, the pattern of using numeric feature vectors (embeddings) to represent code semantics and using vector distance to represent semantic similarity has replaced traditional string matching methods. The quality of semantic representations is critical to the effectiveness of downstream tasks such as code search. Currently, the state-of-the-art (SOTA) learning method uses the contrastive learning paradigm. The objective of contrastive learning is to maximize the similarity between matching code and query (positive samples) and minimize the similarity between mismatched pairs (negative samples). To increase the reusing of negative samples, prior contrastive learning approaches use a large queue (memory bank) to store embeddings. Problem. However, there is still a lot of room for improvement in using negative examples for code search: ① Due to the random selection of negative samples, semantic representations learned by existing models cannot distinguish similar codes well. ② Since semantic vectors in the memory bank are reused from previous inference results and then directly used for loss function calculation without gradient descent, the model cannot effectively learn the negative sample semantic information. Method. To solve the above problems, we propose a contrastive learning code search model with hard negative mining called CoCoHaNeRe: ❶ To enable the model to distinguish similar codes, we introduce hard negative examples into contrastive training, which are negative examples in the codebase that are most similar to positive examples. As a result, hard negative examples are most likely to make the model make mistakes. ❷ To improve the learning efficiency of negative samples during training, we add all hard negative examples to the model's gradient descent process. Result. To verify the effectiveness of CoCoHaNeRe, we conducted experiments on large code search datasets with six programming languages, as well as similar retrieval tasks code clone detection and code question answering. Experimental results show that our model achieves SOTA performance. In the code search task, the average MRR score of CoCoHaNeRe exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 11.25%, 8.13%, and 7.38%, respectively. It has also made great progress in code clone detection and code question answering. In addition, our method performs well in different programming languages and code pre-training models. Furthermore, qualitative analysis shows that our model effectively distinguishes high-order semantic differences between similar codes.

NV-Retriever: Improving text embedding models with effective hard-negative mining

Enhancing Retrieval Performance: An Ensemble Approach For Hard Negative Mining

Conan-embedding: General Text Embedding with More and Better Negative Samples

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Selectively Hard Negative Mining for Alleviating Gradient Vanishing in Image-Text Matching

GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

Improving semantic video retrieval models by training with a relevance-aware online mining strategy

Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining

Towards Robust Text Retrieval with Progressive Learning

Effective Hard Negative Mining for Contrastive Learning-based Code Search

Feature Fusion for Image Retrieval with Adaptive Bitrate Allocation and Hard Negative Mining.

Neighborhood-based Hard Negative Mining for Sequential Recommendation

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark

Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative Elimination

BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine Translation

Enhancing Q&A Text Retrieval with Ranking Models: Benchmarking, fine-tuning and deploying Rerankers for RAG

Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval

Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting

On Debiasing Text Embeddings Through Context Injection