Exploring the Impact of Vocabulary Techniques on Code Completion: A Comparative Approach

Yasir Hussain,Zhiqiu Huang,Yu Zhou,Izhar Ahmed Khan
DOI: https://doi.org/10.1142/s0218194023500687
IF: 1.007
2024-01-16
International Journal of Software Engineering and Knowledge Engineering
Abstract:International Journal of Software Engineering and Knowledge Engineering, Ahead of Print. Integrated Development Environments (IDEs) are pivotal in enhancing productivity with features like code completion in modern software development. Recent advancements in Natural Language Processing (NLP) have empowered neural language models for code completion. In this study, we present an extensive investigation of the impact of open and closed vocabulary systems on the task of code completion. Specifically, we compare open and closed vocabulary systems with various vocabulary sizes to observe their impact on code completion performance. We experiment with three different open vocabulary systems: byte pair encoding (BPE), WordPiece and Unigram to compare them with closed-vocabulary systems to analyze their modeling performance. We also conduct experiments with different context sizes to study their impact on code completion performance. We have experimented using various prominent language models, including one from recurrent neural networks and five from transformers. Our results indicate that vocabulary size significantly impacts modeling performance and can artificially boost the accuracy of code completion models, especially in the case of a closed-vocabulary system. Moreover, we find that different vocabulary systems have varying impacts on token coverage, whereas open-vocabulary systems exhibit better token coverage. Our findings offer valuable insights for building effective code completion models, aiding researchers and practitioners in this field.
computer science, artificial intelligence,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?