Automatic Identification of Code Smell Discussions on Stack Overflow: A Preliminary Investigation.

Sergei Shcherban,Peng Liang,Amjed Tahir,Xueying Li
DOI: https://doi.org/10.1145/3382494.3422161
2020-01-01
Abstract:Background: Code smells indicate potential design or implementation problems that may have a negative impact on programs. Similar to other software artefacts, developers use Stack Overflow (SO) to ask questions about code smells. However, given the high number of questions asked on the platform, and the limitations of the default tagging system, it takes significant effort to extract knowledge about code smells by means of manual approaches. Aim: We utilized supervised machine learning techniques to automatically identify code-smell discussions from SO posts. Method: We conducted an experiment using a manually labeled dataset that contains 3000 code-smell and 3000 non-code-smell posts to evaluate the performance of different classifiers when automatically identifying code smell discussions. Results: Our results show that Logistic Regression (LR) with parameter C=20 (inverse of regularization strength) and Bag of Words (BoW) feature extraction technique achieved the best performance amongst the algorithms we evaluated with a precision of 0.978, a recall of 0.965, and an F1-score of 0.971. Conclusion: Our results show that machine learning approach can effectively locate code-smell posts even if posts' title and/or tags cannot be of help. The technique can be used to extract code smell discussions from other textual artefacts (e.g., code reviews), and promisingly to extract SO discussions of other topics.
What problem does this paper attempt to address?