Abstract:Traditional defect prediction approaches often use metrics that measure the complexity of the design or implementing code of a software system, such as the number of lines of code in a source file. In this paper, we explore a different approach based on content of source code. Our key assumption is that source code of a software system contains information about its technical aspects and those aspects might have different levels of defect-proneness. Thus, content-based features such as words, topics, data types, and package names extracted from a source code file could be used to predict its defects. We have performed an extensive empirical evaluation and found that: i) such content-based features have higher predictive power than code complexity metrics and ii) the use of feature selection, reduction, and combination further improves the prediction performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of traditional software defect prediction methods. Traditional methods usually rely on indicators that measure code complexity, such as the number of lines of code (LOC) in the source file, and these indicators are insufficient in predicting defects. The author proposes a new method based on the characteristics of source code content, aiming to improve the accuracy and efficiency of defect prediction. ### Problem Background Software defects frequently occur during the software development process, resulting in high costs and time consumption. According to Gallaher and Kropp (2002), software defects cause nearly $60 billion in losses to the US economy every year. In addition, Hailpern and Santhanam (2002) found that finding and fixing defects accounts for 50 - 75% of the total development cost of software projects. Therefore, early detection of software defects is crucial for reducing development costs and improving development efficiency. ### Limitations of Traditional Methods Traditional defect prediction methods mainly rely on indicators that measure code complexity, such as the number of lines of code (LOC), depth of inheritance tree (DIT), number of sub - classes (NOC), lack of method cohesion (LCOM), and coupling degree between objects (CBO). However, these indicators cannot fully reflect the functional characteristics of the code and its potential defect - prone tendencies. ### Assumptions of the New Method The author proposes two key assumptions: 1. **Differences in Functional Modules**: A software system usually implements multiple functional modules, and each module may have different defect - prone tendencies. For example, in the JEdit editor, the code for the GUI and editing commands is highly error - prone, while the code for text search and parsing is relatively stable. 2. **Function Inference**: The function of a code module can be inferred from its content (such as identifiers, comments, annotations, string literals, keywords, embedded documents, etc.). Developers often use naming methods that can imply functions, for example, the class name `OptionsDialog` obviously indicates something related to the options dialog. ### Proposed Method Based on the above assumptions, the author explores four new content - based features: 1. **Term Features**: Text terms extracted from all text elements in the code (such as comments or identifiers). These terms are extracted using word segmentation and stemming techniques and represented by the bag - of - words model. 2. **Topic Features**: Topics are generated from the extracted terms through the LDA (Latent Dirichlet Allocation) topic - modeling technique. Each topic represents a set of related words. 3. **Type Features**: Including the data types of variables and objects, and the packages that contain these types. Compared with text features, types provide a higher - level abstraction. 4. **Package Features**: Package information is extracted from type features, providing a higher - level abstraction. ### Experimental Verification The author conducted extensive empirical evaluations on a public defect data set of 14 actual software systems with 42 versions. The experimental results show that content - based features have higher predictive power than traditional code complexity indicators (such as LOC or CK metrics), and the prediction accuracy can be further improved through feature selection and dimension - reduction techniques. ### Main Contributions 1. New defect - prediction features are proposed, including term, topic, type, and package features. 2. Empirical evaluations are carried out to compare the effects of these new features with traditional code - metric indicators. Through this method, the author hopes to be able to predict software defects more accurately in the early stage, thus helping developers focus on more targeted defect - detection and - repair work.

Defect Prediction with Content-based Features