Automatic Identification of Decisions from the Hibernate Developer Mailing List
Xueying Li,Peng Liang,Zengyang Li
DOI: https://doi.org/10.1145/3383219.3383225
2020-04-15
Abstract:Decisions run through the whole software development and maintenance processes. Explicitly documenting these decisions helps to organize development knowledge and to reduce its vaporization, thereby controlling the development process and maintenance costs. It can also support the knowledge acquisition process for stakeholders of the project. Meanwhile, developers (e.g., architects) and managers will be able to rely on the decisions made in the past to solve the problems encountered in their current projects. However, identifying decisions from massive textual artifacts, which involves considerable human effort, time, and cost, is usually unaffordable due to limited resources. To address this problem, we conducted an experiment to automatically identify decisions from textual artifacts using machine learning techniques. We created a dataset of 1,300 sentences labelled from the Hibernate developer mailing list, containing 650 decision sentences and non-decision sentences respectively, and trained machine learning models using 160 configurations regarding text preprocessing, feature extraction, and classification algorithms. The results show that (1) the text preprocessing method with Including Stop Words, No Stemming and Lemmatization, and No Filtering Out Sentences performs best when preprocessing posts to identify decisions; (2) the simple Bag-of-Words (BoW) model works best when extracting features to identify decisions; (3) the Support Vector Machine (SVM) algorithm gets the best result when training classifiers to identify decisions; and (4) the SVM algorithm with Including Stop Words (ISW), No Stemming and Lemmatization (NSaL), Filtering Out Sentences by Length (FOSbL), and BoW achieves the best performance (with a precision of 0.640, a recall of 0.932, and an F1-score of 0.759), compared with other configurations when identifying decisions from the mailing list.