Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit

Yao Wan,Yang He,Zhangqian Bi,Jianguo Zhang,Hongyu Zhang,Yulei Sui,Guandong Xu,Hai Jin,Philip S. Yu

2023-12-31

Abstract:Code intelligence leverages machine learning techniques to extract knowledge from extensive code corpora, with the aim of developing intelligent tools to improve the quality and productivity of computer programming. Currently, there is already a thriving research community focusing on code intelligence, with efforts ranging from software engineering, machine learning, data mining, natural language processing, and programming languages. In this paper, we conduct a comprehensive literature review on deep learning for code intelligence, from the aspects of code representation learning, deep learning techniques, and application tasks. We also benchmark several state-of-the-art neural models for code intelligence, and provide an open-source toolkit tailored for the rapid prototyping of deep-learning-based code intelligence models. In particular, we inspect the existing code intelligence models under the basis of code representation learning, and provide a comprehensive overview to enhance comprehension of the present state of code intelligence. Furthermore, we publicly release the source code and data resources to provide the community with a ready-to-use benchmark, which can facilitate the evaluation and comparison of existing and future code intelligence models (

Software Engineering,Artificial Intelligence

What problem does this paper attempt to address?

This paper focuses on using deep learning to enhance code intelligence research. Code intelligence refers to the use of machine learning techniques to extract knowledge from large-scale code repositories in order to develop intelligent tools that improve programming quality and efficiency. The authors provide a comprehensive review of research in this field, with particular emphasis on the application of deep learning. They also establish a benchmark test suite and open-source toolkit for rapid prototyping and evaluation of deep learning-based code intelligence models. The paper begins by introducing the foundation of code intelligence - code representation learning, which involves encoding the semantic meaning of source code into distributed vectors that can be utilized for various downstream tasks such as code completion, search, summarization, and type inference. The authors then compare existing research and propose an open-source toolkit called NaturalCC, which integrates various state-of-the-art models for benchmark testing and model development across different tasks. Furthermore, the paper analyzes the recent trends in the application of deep learning in code intelligence, highlighting the rapid development in this field, especially with the advancement of large-scale language models like ChatGPT, which significantly enhances the capabilities of pre-trained code models. Finally, the paper identifies several challenging and promising directions for future research. In summary, the problem addressed in this paper is how to improve code intelligence through deep learning, including effectively representing and understanding source code, as well as building and evaluating deep learning-based tools to support programming tasks.

Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit

A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond

Survey of Code Search Based on Deep Learning

Deep Learning for Code Generation: a Survey

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

A Survey on Artificial Intelligence for Source Code: A Dialogue Systems Perspective

Deep Learning Meets Software Engineering: A Survey on Pre-Trained Models of Source Code.

A Survey of Deep Learning Models for Structural Code Understanding

CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

A Survey on Deep Learning for Software Engineering

CodeTF: One-stop Transformer Library for State-of-the-art Code LLM

Deep Learning for Source Code Modeling and Generation

Deep Learning for Source Code Modeling and Generation: Models, Applications and Challenges

Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Deep Learning Based Code Generation Methods: Literature Review

Collective Intelligence for Deep Learning: A Survey of Recent Developments

Deep Learning Based Code Generation Methods: A Literature Review.

SciCode: A Research Coding Benchmark Curated by Scientists