Abstract:Programmers frequently search for source code to reuse using keyword searches. The search effectiveness in facilitating reuse, however, depends on the programmer's ability to specify a query that captures how the desired code may have been implemented. Further, the results often include many irrelevant matches that must be filtered manually. More semantic search approaches could address these limitations, yet existing approaches are either not flexible enough to find approximate matches or require the programmer to define complex specifications as queries. We propose a novel approach to semantic code search that addresses several of these limitations and is designed for queries that can be described using a concrete input/output example. In this approach, programmers write lightweight specifications as inputs and expected output examples. Unlike existing approaches to semantic search, we use an SMT solver to identify programs or program fragments in a repository, which have been automatically transformed into constraints using symbolic analysis, that match the programmer-provided specification. We instantiated and evaluated this approach in subsets of three languages, the Java String library, Yahoo! Pipes mashup language, and SQL select statements, exploring its generality, utility, and trade-offs. The results indicate that this approach is effective at finding relevant code, can be used on its own or to filter results from keyword searches to increase search precision, and is adaptable to find approximate matches and then guide modifications to match the user specifications when exact matches do not already exist. These gains in precision and flexibility come at the cost of performance, for which underlying factors and mitigation strategies are identified.

On the Lexical Distinguishability of Source Code

A study of the uniqueness of source code

Solving the Search for Source Code

Query expansion via WordNet for effective code search

An Open Framework for Semantic Code Queries on Heterogeneous Repositories

Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers

Code Comments: A Way of Identifying Similarities in the Source Code

Measuring source code conciseness across programming languages using compression

A Mathematical Model for Universal Semantics

A Language-Agnostic Model for Semantic Source Code Labeling

CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework

Context-aware Code Summary Generation

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

Machine Learning Based Source Code Classification Using Syntax Oriented Features

Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection

Combining Code Context and Fine-grained Code Difference for Commit Message Generation

Logical Segmentation of Source Code

Extracting Code-relevant Description Sentences Based on Structural Similarity

Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting

Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy

Studying the difference between natural and programming language corpora