Boosting Commit Classification Based on Multivariate Mixed Features and Heterogeneous Classifier Selection
Yuhan Wu,Yingling Li,Ziao Wang,Qushan Tan,Jing Liu,Yuao Jiang
DOI: https://doi.org/10.1142/s021819402450044x
IF: 1.007
2024-10-02
International Journal of Software Engineering and Knowledge Engineering
Abstract:International Journal of Software Engineering and Knowledge Engineering, Ahead of Print. Commit classification plays a crucial role in software maintenance, as it permits developers to make informed decisions regarding resource allocation and code review. There are several approaches for automatic commit classification, yet they do not sufficiently explore the features related to commits and consider the advantages of ensemble models over individual models. Therefore, there is some room for improvement. In this paper, we propose MuheCC, a commit classification approach based on multivariate mixed features and heterogeneous classifier selection to address these challenges. It mainly consists of three phases: (1) Multivariate mixed feature extraction, which extracts features from commit messages, changed code and handcrafted features to construct comprehensive mixed features; (2) Hyperparameter tuning based on genetic algorithm, which utilizes genetic algorithm to optimize candidate traditional models and ensemble models; (3) Heterogeneous classifier selection, which selects the optimal combinations of traditional and ensemble models, respectively, to build a heterogeneous classifier for commit classification. To evaluate this approach, we extend an existing dataset with code changes for each commit and compare MuheCC with three baselines on this real-world dataset. The results show that MuheCC outperforms all baselines, especially improving the best baseline by 7.25% for accuracy, 6.88% for precision, 7.25% for recall and 7.06% for [math]-score. Furthermore, the ablation experiments validate that the performance advantage of MuheCC is mainly attributed to the multivariate features (e.g. 12.55% contributions to accuracy) and the heterogeneous classifier (e.g. 12.26% contributions to accuracy). We further discuss the impact of hyperparameter tuning and heterogeneous classifier selection on the performance of MuheCC. These results prove the superiority and potential practical value of MuheCC.
computer science, artificial intelligence,engineering, electrical & electronic, software engineering