Active Code Learning: Benchmarking Sample-Efficient Training of Code Models

Qiang Hu,Yuejun Guo,Xiaofei Xie,Maxime Cordy,Lei Ma,Mike Papadakis,Yves Le Traon
2023-06-02
Abstract:The costly human effort required to prepare the training data of machine learning (ML) models hinders their practical development and usage in software engineering (ML4Code), especially for those with limited budgets. Therefore, efficiently training models of code with less human effort has become an emergent problem. Active learning is such a technique to address this issue that allows developers to train a model with reduced data while producing models with desired performance, which has been well studied in computer vision and natural language processing domains. Unfortunately, there is no such work that explores the effectiveness of active learning for code models. In this paper, we bridge this gap by building the first benchmark to study this critical problem - active code learning. Specifically, we collect 11 acquisition functions~(which are used for data selection in active learning) from existing works and adapt them for code-related tasks. Then, we conduct an empirical study to check whether these acquisition functions maintain performance for code data. The results demonstrate that feature selection highly affects active learning and using output vectors to select data is the best choice. For the code summarization task, active code learning is ineffective which produces models with over a 29.64\% gap compared to the expected performance. Furthermore, we explore future directions of active code learning with an exploratory study. We propose to replace distance calculation methods with evaluation metrics and find a correlation between these evaluation-based distance methods and the performance of code models.
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently train code models in software engineering when using machine learning (ML4Code) by reducing the need for manually labeled training data. Specifically, the authors focus on the application of active learning techniques to code models, aiming to explore how to achieve the expected model performance with less labeled data, thereby reducing development costs and time consumption. ### Problem Background 1. **Challenges of ML4Code**: - In the field of software engineering, using machine learning models to help developers solve problems (such as code summarization, code clone detection, vulnerability detection, etc.) is a popular research direction. - Training these models usually requires a large amount of labeled data, and the data labeling process is time - consuming and labor - intensive, especially when the budget is limited. 2. **The Role of Active Learning**: - Active learning is a technique that iteratively selects a small amount of the most valuable data for labeling, which can maintain model performance while reducing the labeling workload. - Although active learning has been widely studied in the fields of computer vision and natural language processing, its application to code models is not clear. ### Research Objectives - **Construct Benchmark Tests**: The authors constructed the first benchmark test platform for evaluating the effect of active learning on code models. - **Explore Feature Selection**: Research the influence of different types of features (such as code tokens, code embedding vectors, model output vectors) on the effect of active learning. - **Compare Acquisition Functions**: Compare the performance of different acquisition functions in code tasks to determine the most suitable active learning method. - **Future Directions**: Explore new acquisition function design strategies, especially distance calculation methods based on evaluation metrics. ### Main Contributions 1. **Constructed a sample - efficient training benchmark for code models for the first time**. 2. **Found that existing research conclusions are not applicable to code data**: For example, simple methods are not necessarily superior to complex clustering methods; using only 10% of the data cannot achieve the same performance as the full amount of data. 3. **Proposed a new distance calculation method based on evaluation metrics**, providing new ideas for future research. ### Experimental Design - **Feature Selection**: Studied the influence of three features (code tokens, code embedding vectors, model output vectors) on clustering - based acquisition functions. - **Acquisition Function Comparison**: Compared the performance of 11 acquisition functions on different types of code tasks. - **Experimental Environment**: Used two pre - trained code models (CodeBERT and GraphCodeBERT) and conducted experiments on multiple datasets. ### Experimental Results - **Feature Selection**: The model output vector is the optimal feature in most cases, especially in non - classification tasks. - **Acquisition Function Performance**: The clustering - based method is superior to simple uncertainty methods in some tasks, but performs poorly in code summarization tasks. - **Future Directions**: Proposed a new method based on evaluation - metric - distance calculation and found a weak correlation between it and model performance. In conclusion, through constructing benchmark tests and empirical research, this paper reveals the potential and challenges of active learning in code models and provides valuable references for future research.