Abstract:This paper investigates the use of k‐nearest neighbors imputation (KNNI) to deal with missing data in software development effort estimation (SDEE). KNNI, in its classical process, has low tolerance to imprecision and uncertainty especially when dealing with categorical features. We evaluate the use of an optimized fuzzy clustering‐based KNNI (FC‐KNNI) and compare it with classical KNN when dealing with mixed data in the context of SDEE. The results are promising in the sense that using an imputation technique designed for mixed data is better than reusing methods originally designed for numerical data. KNNI, in its classical process, has low tolerance to imprecision and uncertainty especially when dealing with categorical features. Context Software development effort estimation (SDEE) is one of the most challenging aspects in project management. The presence of missing data (MD) in software attributes makes SDEE even more complex. K‐nearest neighbors imputation (KNNI) has been widely used in SDEE to deal with the MD issue. However, KNNI, in its classical process, has low tolerance to imprecision and uncertainty especially when dealing with categorical features. When dealing with categorical attributes, KNNI uses a classical approach, employing mainly numbers or classical intervals to represent software attributes and similarity measures originally designed for numerical attributes. Objectives This paper evaluates the use of an optimized fuzzy clustering‐based KNNI (FC‐KNNI) and compares it with classical KNN when dealing with mixed data in the context of SDEE. Methods We investigate the effect of two imputation techniques (FC‐KNNI and KNNI) on five SDEE techniques: case‐based reasoning, fuzzy case‐based reasoning, support vector regression, multilayer perceptron, and reduced‐error pruning tree. The evaluation is carried out using six publicly available datasets for SDEE using two performance measures, standardized accuracy (SA), and Pred (0.25). The Wilcoxon statistical test is also performed to assess the significance of results. Results The results are promising in the sense that using an imputation technique designed for mixed data is better than reusing methods originally designed for numerical data. We found that FC‐KNNI significantly outperforms KNNI regardless of the SDEE technique and dataset used. Another important finding is that F‐CBR improved the analogy process compared to CBR. Conclusion The introduction of fuzzy sets and fuzzy clustering in the analogy process improves its performances in terms of SA and Pred (0.25).

A novel ranked k-nearest neighbors algorithm for missing data imputation

Missing Data Imputation for Classification Problems

Numerical Data Imputation for Multimodal Data Sets: A Probabilistic Nearest-Neighbor Kernel Density Approach

An approach to dealing with missing values in heterogeneous data using k-nearest neighbors

An improved K-Nearest neighbour with grasshopper optimization algorithm for imputation of missing data

Integrated ECOD-KNN Algorithm for Missing Values Imputation in Datasets: Outlier Removal

Addressing Missing Data in a Healthcare Dataset Using an Improved kNN Algorithm

Missing data imputation by K nearest neighbours based on grey relational structure and mutual information

Usage of Clustering and Weighted Nearest Neighbors for Efficient Missing Data Imputation of Microarray Gene Expression Dataset

Temporal and Spatial Nearest Neighbor Values Based Missing Data Imputation in Wireless Sensor Networks

APT-KNN:AN EFFICIENT MISSING VALUE IMPUTATION METHOD ORIENTED TOWARD CLASSIFICATION ISSUE

Optimized fuzzy clustering‐based k‐nearest neighbors imputation for mixed missing data in software development effort estimation

A Novel Fuzzy Rough Clustering Parameter-based missing value imputation

Exploiting nearest neighbor data and fuzzy membership function to address missing values in classification

An Empirical Study of Dynamic Incomplete-Case Nearest Neighbor Imputation in Software Quality Data.

Missing data imputation using correlation coefficient and min-max normalization weighting

CHOOSING APPROPRIATE IMPUTATION METHODS FOR MISSING DATA: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

Hybrid Missing Value Imputation Algorithm- KLR

A Probabilistic Approach for Missing Data Imputation

Performance Comparison of Hot-Deck Imputation, K-Nearest Neighbor Imputation, and Predictive Mean Matching in Missing Value Handling, Case Study: March 2019 SUSENAS Kor Dataset

An Intelligent Missing Data Imputation Techniques: A Review