Abstract:Evaluating classification accuracy is a key component of the training and validation stages of thematic map production, and the choice of metric has profound implications for both the success of the training process and the reliability of the final accuracy assessment. We explore key considerations in selecting and interpreting loss and assessment metrics in the context of data imbalance, which arises when the classes have unequal proportions within the dataset or landscape being mapped. The challenges involved in calculating single, integrated measures that summarize classification success, especially for datasets with considerable data imbalance, have led to much confusion in the literature. This confusion arises from a range of issues, including a lack of clarity over the redundancy of some accuracy measures, the importance of calculating final accuracy from population-based statistics, the effects of class imbalance on accuracy statistics, and the differing roles of accuracy measures when used for training and final evaluation. In order to characterize classification success at the class level, users typically generate averages from the class-based measures. These averages are sometimes generated at the macro-level, by taking averages of the individual-class statistics, or at the micro-level, by aggregating values within a confusion matrix, and then, calculating the statistic. We show that the micro-averaged producer's accuracy (recall), user's accuracy (precision), and F1-score, as well as weighted macro-averaged statistics where the class prevalences are used as weights, are all equivalent to each other and to the overall accuracy, and thus, are redundant and should be avoided. Our experiment, using a variety of loss metrics for training, suggests that the choice of loss metric is not as complex as it might appear to be, despite the range of choices available, which include cross-entropy (CE), weighted CE, and micro- and macro-Dice. The highest, or close to highest, accuracies in our experiments were obtained by using CE loss for models trained with balanced data, and for models trained with imbalanced data, the highest accuracies were obtained by using weighted CE loss. We recommend that, since weighted CE loss used with balanced training is equivalent to CE, weighted CE loss is a good all-round choice. Although Dice loss is commonly suggested as an alternative to CE loss when classes are imbalanced, micro-averaged Dice is similar to overall accuracy, and thus, is particularly poor for training with imbalanced data. Furthermore, although macro-Dice resulted in models with high accuracy when the training used balanced data, when the training used imbalanced data, the accuracies were lower than for weighted CE. In summary, the significance of this paper lies in its provision of readers with an overview of accuracy and loss metric terminology, insight regarding the redundancy of some measures, and guidance regarding best practices.

Empirical analysis of performance assessment for imbalanced classification

Measuring Class-Imbalance Sensitivity of Deterministic Performance Evaluation Metrics

Evaluating classifier performance with highly imbalanced Big Data

Imbalanced class distribution and performance evaluation metrics: A systematic review of prediction accuracy for determining model performance in healthcare systems

Robust performance metrics for imbalanced classification problems

Selecting and Interpreting Multiclass Loss and Accuracy Assessment Metrics for Classifications with Class Imbalance: Guidance and Best Practices

Cost-Sensitive Learning based on Performance Metric for Imbalanced Data

The Effect of Balancing Methods on Model Behavior in Imbalanced Classification Problems

Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance in Binary Classification

Analysis of multi-class classification performance metrics for remote sensing imagery imbalanced datasets

Appropriateness of Performance Indices for Imbalanced Data Classification: An Analysis

An empirical evaluation of imbalanced data strategies from a practitioner's point of view

Iterative Metric Learning for Imbalance Data Classification

A study on cost behaviors of binary classification measures in class-imbalanced problems

Analysis and Comparison of Classification Metrics

Fair evaluation of classifier predictive performance based on binary confusion matrix

Classification performance assessment for imbalanced multiclass data

An Empirical Study on the Joint Impact of Feature Selection and Data Re-sampling on Imbalance Classification

Empirical Comparison of Area under ROC curve (AUC) and Mathew Correlation Coefficient (MCC) for Evaluating Machine Learning Algorithms on Imbalanced Datasets for Binary Classification

Empirical study of Machine Learning Classifier Evaluation Metrics behavior in Massively Imbalanced and Noisy data