Multiclass Classification for Self-Admitted Technical Debt Based on XGBoost
Xin Chen,Dongjin Yu,Xulin Fan,Lin Wang,Jie Chen
DOI: https://doi.org/10.1109/tr.2021.3087864
IF: 5.883
2021-01-01
IEEE Transactions on Reliability
Abstract:In software development, due to the demands from users or the limitations of time and resources, developers tend to adopt suboptimal solutions to achieve quick software development. In such a way, the released software usually involves not-quite-right code that is called technical debt, which will significantly decrease the quality of software and increase the maintenance cost. Recently, the concept of self-admitted technical debt (SATD) is proposed and refers to technical debt that is self-admitted by developers in code comments. Existing studies mainly focus on detecting technical debt by classifying code comments into either SATD or non-SATD. However, different types of SATD has different impacts on software maintenance and needs to be handled by different developers. Therefore, the detected SATD should be further classified so that developers can understand and remove technical debt better. In this article, we propose a new method based on eXtreme Gradient Boosting (XGBoost) to classify SATD into multiple classes. In our approach, we first preprocess the original code comments and adopt the easy data augmentation strategy to overcome the class unbalance problem. Then, chi-square is leveraged to select representative features from the textual feature set. Finally, we apply XGBoost to train a classifier and use the trained classifier to partition each comment into the corresponding class. We experimentally investigate the effectiveness of our approach on a public dataset, including 62 566 code comments from 10 open-source projects. Experimental results show that our approach achieves 56.66 in terms of macroaveraged precision, 59.07 in terms of macroaveraged recall, and 55.77 in terms of macroaveraged F-measure on average, and outperforms the natural language processing based method by 4.98, 5.32, and 3.17, respectively. In addition, the experimental results also demonstrate that the data augmentation strategy is effective in improving the effectiveness of our approach.
engineering, electrical & electronic,computer science, software engineering, hardware & architecture