Abstract:Stack Overflow is a popular on-line question and answer site for software developers to share their experience and expertise. Among the numerous questions posted in Stack Overflow, two or more of them may express the same point and thus are duplicates of one another. Duplicate questions make Stack Overflow site maintenance harder, waste resources that could have been used to answer other questions, and cause developers to unnecessarily wait for answers that are already available. To reduce the problem of duplicate questions, Stack Overflow allows questions to be manually marked as duplicates of others. Since there are thousands of questions submitted to Stack Overflow every day, manually identifying duplicate questions is a difficult work. Thus, there is a need for an automated approach that can help in detecting these duplicate questions. To address the above-mentioned need, in this paper, we propose an automated approach named DupPredictor that takes a new question as input and detects potential duplicates of this question by considering multiple factors. DupPredictor extracts the title and description of a question and also tags that are attached to the question. These pieces of information (title, description, and a few tags) are mandatory information that a user needs to input when posting a question. DupPredictor then computes the latent topics of each question by using a topic model. Next, for each pair of questions, it computes four similarity scores by comparing their titles, descriptions, latent topics, and tags. These four similarity scores are finally combined together to result in a new similarity score that comprehensively considers the multiple factors. To examine the benefit of DupPredictor, we perform an experiment on a Stack Overflow dataset which contains a total of more than two million questions. The result shows that DupPredictor can achieve a recall-rate@20 score of 63.8%. We compare our approach with the standard search engine of Stack Overflow, and DupPredictor improves its recall-rate@10 score by 40.63%. We also compare our approach with approaches that only use title, description, topic, and tag similarity and Runeson et al.’s approach that has been used to detect duplicate bug reports, and DupPredictor improves their recall-rate@10 scores by 27.2%, 97.4%, 746.0%, 231.1%, and 16.4% respectively.

Detecting Duplicate Bug Reports with Convolutional Neural Networks

HINDBR: Heterogeneous Information Network Based Duplicate Bug Report Prediction

Practical Duplicate Bug Reports Detection in a Large Web-Based Development Community.

Does Deep Learning Improve the Performance of Duplicate Bug Report Detection? an Empirical Study

Towards Understanding the Impacts of Textual Dissimilarity on Duplicate Bug Report Detection

Automated Bug Report Field Reassignment and Refinement Prediction

Combining Word Embedding with Information Retrieval to Recommend Similar Bug Reports

An approach to detecting duplicate bug reports using natural language and execution information

Developer Activity Motivated Bug Triaging: Via Convolutional Neural Network

Automated Configuration Bug Report Prediction Using Text Mining.

Automated Identification of High Impact Bug Reports Leveraging Imbalanced Learning Strategies

Convolutional Neural Networks-Based Locating Relevant Buggy Code Files For Bug Reports Affected By Data Imbalance

Learning Unified Features from Natural and Programming Languages for Locating Buggy Source Code

Auto-labelling of Bug Report using Natural Language Processing

CUPID: Leveraging ChatGPT for More Accurate Duplicate Bug Report Detection

Combining Retrieval and Classification: Balancing Efficiency and Accuracy in Duplicate Bug Report Detection

High-Impact Bug Report Identification with Imbalanced Learning Strategies

BugListener: Identifying and Synthesizing Bug Reports from Collaborative Live Chats

Control Flow Graph Embedding Based on Multi-Instance Decomposition for Bug Localization.

A Novel Deep-Learning-Based Bug Severity Classification Technique Using Convolutional Neural Networks and Random Forest with Boosting

Multi-Factor Duplicate Question Detection in Stack Overflow