Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit

Yao Wan,Yang He,Zhangqian Bi,Jianguo Zhang,Hongyu Zhang,Yulei Sui,Guandong Xu,Hai Jin,Philip S. Yu
2023-12-31
Abstract:Code intelligence leverages machine learning techniques to extract knowledge from extensive code corpora, with the aim of developing intelligent tools to improve the quality and productivity of computer programming. Currently, there is already a thriving research community focusing on code intelligence, with efforts ranging from software engineering, machine learning, data mining, natural language processing, and programming languages. In this paper, we conduct a comprehensive literature review on deep learning for code intelligence, from the aspects of code representation learning, deep learning techniques, and application tasks. We also benchmark several state-of-the-art neural models for code intelligence, and provide an open-source toolkit tailored for the rapid prototyping of deep-learning-based code intelligence models. In particular, we inspect the existing code intelligence models under the basis of code representation learning, and provide a comprehensive overview to enhance comprehension of the present state of code intelligence. Furthermore, we publicly release the source code and data resources to provide the community with a ready-to-use benchmark, which can facilitate the evaluation and comparison of existing and future code intelligence models (
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
This paper focuses on using deep learning to enhance code intelligence research. Code intelligence refers to the use of machine learning techniques to extract knowledge from large-scale code repositories in order to develop intelligent tools that improve programming quality and efficiency. The authors provide a comprehensive review of research in this field, with particular emphasis on the application of deep learning. They also establish a benchmark test suite and open-source toolkit for rapid prototyping and evaluation of deep learning-based code intelligence models. The paper begins by introducing the foundation of code intelligence - code representation learning, which involves encoding the semantic meaning of source code into distributed vectors that can be utilized for various downstream tasks such as code completion, search, summarization, and type inference. The authors then compare existing research and propose an open-source toolkit called NaturalCC, which integrates various state-of-the-art models for benchmark testing and model development across different tasks. Furthermore, the paper analyzes the recent trends in the application of deep learning in code intelligence, highlighting the rapid development in this field, especially with the advancement of large-scale language models like ChatGPT, which significantly enhances the capabilities of pre-trained code models. Finally, the paper identifies several challenging and promising directions for future research. In summary, the problem addressed in this paper is how to improve code intelligence through deep learning, including effectively representing and understanding source code, as well as building and evaluating deep learning-based tools to support programming tasks.