Abstract:(Source) code search is widely concerned by software engineering researchers because it can improve the productivity and quality of software development. Given a functionality requirement usually described in a natural language sentence, a code search system can retrieve code snippets that satisfy the requirement from a large-scale code corpus, e.g., GitHub. To realize effective and efficient code search, many techniques have been proposed successively. These techniques improve code search performance mainly by optimizing three core components, including query understanding component, code understanding component, and query-code matching component. In this paper, we provide a 3-dimensional perspective survey for code search. Specifically, we categorize existing code search studies into query-end optimization techniques, code-end optimization techniques, and match-end optimization techniques according to the specific components they optimize. These optimization techniques are proposed to enhance the performance of specific components, and thus the overall performance of code search. Considering that each end can be optimized independently and contributes to the code search performance, we treat each end as a dimension. Therefore, this survey is 3-dimensional in nature, and it provides a comprehensive summary of each dimension in detail. To understand the research trends of the three dimensions in existing code search studies, we systematically review 68 relevant literatures. Different from existing code search surveys that only focus on the query end or code end or introduce various aspects shallowly (including codebase, evaluation metrics, modeling technique, etc.), our survey provides a more nuanced analysis and review of the evolution and development of the underlying techniques used in the three ends. Based on a systematic review and summary of existing work, we outline several open challenges and opportunities at the three ends that remain to be addressed in future work.

An Empirical Study on Code Search Pre-trained Models: Academic Progresses Vs. Industry Requirements

An Empirical Comparison of Pre-Trained Models of Source Code

Opportunities and Challenges in Code Search Tools

Survey of Code Search Based on Deep Learning

An Empirical Study of Code Search in Intelligent Coding Assistant: Perceptions, Expectations, and Directions

A Survey of Source Code Search: A 3-Dimensional Perspective

Comparing the Pretrained Models of Source Code by Re-pretraining under a Unified Setup.

MCodeSearcher: Multi-View Contrastive Learning for Code Search.

Big Code Search: a Bibliography

Automating Code Review Activities by Large-Scale Pre-training

CodeRetriever: A Large Scale Contrastive Pre-Training Method for Code Search

Query-oriented two-stage attention-based model for code search

Exploring Representation-Level Augmentation for Code Search

Revisiting Code Search in a Two-Stage Paradigm

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Empirical Study on Transformer-based Techniques for Software Engineering

Active Code Learning: Benchmarking Sample-Efficient Training of Code Models

Codehow: Effective Code Search Based On Api Understanding And Extended Boolean Model

CoSQA+: Enhancing Code Search Dataset with Matching Code

REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models

Code semantic enrichment for deep code search