Deep-AutoCoder: Learning to Complete Code Precisely with Induced Code Tokens

Xing Hu,Rui Men,Ge Li,Zhi Jin
DOI: https://doi.org/10.1109/compsac.2019.00030
2019-01-01
Abstract:Code completion is an essential part of modern IDEs. It assists the developers to speed up the process of coding and reducing typos. In this paper, we exploit the deep learning technique called LSTM to learn language models over large code corpus and make predictions of code elements. Unlike natural language, the innumerable identifiers lead to the vocabulary explosion and more difficult to predict. Therefore, we propose a new approach, the Induced Token based LSTM, to deal with the massive identifiers, thus decrease the vocabulary size. In order to induce the code tokens, we present two approaches, one is a constraint character-level LSTM and the other one is encoding identifiers with various preceding context before feeding them into a token-level LSTM. Based on the two approaches, a tool named Deep-AutoCoder is developed and evaluated in two classic completion scenarios, that is, method invocation completion and random completion. The experiment results indicate that Deep-AutoCoder outperforms the state-of-the-arts on method invocation completion and random code completion. Additionally, the empirical results of Deep-AutoCoder indicate that reducing the size of vocabulary can effectively improve the precision of code completion.
What problem does this paper attempt to address?