A new and better method of detecting duplicate defect reports using n-gram method

Ning Li-,Zhanhuai Li,Lijun Zhang
DOI: https://doi.org/10.3969/j.issn.1000-2758.2010.02.028
2010-01-01
Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University
Abstract:The introduction of the full paper points out what we believe to be the shortcomings of existing papers in the open literature[2,3]. Hence we propose a new and better method. Subsection 1.2 briefs the N-gram model. Section 2 explains our new and better method of detecting duplicate defect reports using N-gram method. The titles of subsections 2.1, 2.2, 2.3, 2.4, 2.5, 2.7 are respectively tokenization, word stemming, synonym replacement, stop word removal, N-gram similarity calculation and duplicate defect report detection accuracy measurement; in particular, Formula (6) in subsection 2.7 is very important for calculating the recall rate of our method. In section 3, we select the N-parameter, the complete-sentence syntax and the summary information on software defect report with a small subset of Firefox defect repository and evaluate our method with a large subset of Firefox defect repository including 4503 defect reports. The experimental results, presented in Figs. 2 and 3, show preliminarily that the recall rate of our method increases by 25% to 55% compared with that of the traditional Vector Space Model method in detecting duplicate defect reports.
What problem does this paper attempt to address?