Relevance Transformer: Generating Concise Code Snippets with Relevance Feedback

Carlos Gemmell,Federico Rossetto,Jeffrey Dalton
DOI: https://doi.org/10.1145/3397271.3401215
2020-12-09
Abstract:Tools capable of automatic code generation have the potential to augment programmer's capabilities. While straightforward code retrieval is incorporated into many IDEs, an emerging area is explicit code generation. Code generation is currently approached as a Machine Translation task, with Recurrent Neural Network (RNN) based encoder-decoder architectures trained on code-description pairs. In this work we introduce and study modern Transformer architectures for this task. We further propose a new model called the Relevance Transformer that incorporates external knowledge using pseudo-relevance feedback. The Relevance Transformer biases the decoding process to be similar to existing retrieved code while enforcing diversity. We perform experiments on multiple standard benchmark datasets for code generation including Django, Hearthstone, and CoNaLa. The results show improvements over state-of-the-art methods based on BLEU evaluation. The Relevance Transformer model shows the potential of Transformer-based architectures for code generation and introduces a method of incorporating pseudo-relevance feedback during inference.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve are two key challenges in code generation: improving the accuracy and diversity of the generated code. Specifically, the author focuses on how to use external knowledge (such as existing code snippets) to improve the task of generating code based on natural - language descriptions. ### Problem Background During the programming process, programmers often need to query and refer to a large amount of programming languages, libraries, and technologies, which causes them to frequently search for example code or syntax instructions online. This practice not only prolongs the development process but also reduces productivity. Although existing code retrieval tools can help programmers find relevant code, they are usually not flexible enough to adapt to different context requirements. ### Limitations of Existing Methods Most of the current code generation methods are based on the neural machine translation (NMT) system and use the encoder - decoder architecture in the recurrent neural network (RNN). Although these methods perform well on certain tasks, they have the following problems: 1. **Ineffective combination of external knowledge**: Existing models have difficulty integrating external knowledge (such as existing code snippets) into the generation process. 2. **Lack of diversity and accuracy in generation results**: The generated code may lack diversity and is prone to errors in complex situations. ### Solutions Proposed in the Paper To solve the above problems, the author introduced a new model - Relevance Transformer. The main innovations of this model include: 1. **Pseudo - Relevance Feedback**: By retrieving code snippets related to the input description and adjusting the decoding process according to the common words in these snippets, the relevance and accuracy of the generated code are improved. 2. **Copy Mechanism**: Allows the model to directly copy specific words (such as variable names and method identifiers) from the input to deal with the problem of rare words encountered in the generation process. ### Experimental Verification The author conducted experiments on multiple standard benchmark datasets (such as Django, Hearthstone, and CoNaLa), and the results show that Relevance Transformer outperforms the existing state - of - the - art methods in terms of BLEU score, especially on the CoNaLa dataset. ### Summary By introducing pseudo - relevance feedback and copy mechanism, Relevance Transformer can combine external knowledge more effectively and generate more accurate and diverse code snippets. This method not only improves the quality of code generation but also provides new ideas for generation tasks in other fields. ### Formula Summary The formulas involved in the paper mainly include probability distribution and interpolation calculation: 1. **Interpolated probability distribution**: \[ P(y_t|x, y_{<t}) = [\lambda \cdot P_{\text{NMT}}(x, y_{<t}) + (1-\lambda) \cdot P_{\text{retrieval}}(x, y_t) \cdot P_{\text{context}}(y_{<t}, y_t)] \cdot Z \] where \( P_{\text{NMT}} \) is the original neural machine translation distribution, \( P_{\text{retrieval}} \) is the retrieval - based result distribution, \( P_{\text{context}} \) is the context repetition penalty term, and \( Z \) is the normalization constant. 2. **Weighted score of retrieval results**: \[ P_{\text{retrieval}}(x, y_t) = \left[1 - \mathbb{I}_{V_f}(y_t)\right] \cdot \sum_{d \in R(x, K)} P_{\text{score}}(y_t, d) \cdot P_{\text{BM25}}(x, d) \] where \( \mathbb{I}_{V_f} \) is the indicator function, \( R(x, K) \) is the retrieved top \( K \) documents, \( P_{