Evaluation metrics and statistical tests for machine learning

Oona Rainio,Jarmo Teuho,Riku Klén
DOI: https://doi.org/10.1038/s41598-024-56706-x
IF: 4.6
2024-03-14
Scientific Reports
Abstract:Research on different machine learning (ML) has become incredibly popular during the past few decades. However, for some researchers not familiar with statistics, it might be difficult to understand how to evaluate the performance of ML models and compare them with each other. Here, we introduce the most common evaluation metrics used for the typical supervised ML tasks including binary, multi-class, and multi-label classification, regression, image segmentation, object detection, and information retrieval. We explain how to choose a suitable statistical test for comparing models, how to obtain enough values of the metric for testing, and how to perform the test and interpret its results. We also present a few practical examples about comparing convolutional neural networks used to classify X-rays with different lung infections and detect cancer tumors in positron emission tomography images.
multidisciplinary sciences
What problem does this paper attempt to address?
The main objective of this paper is to introduce and explain commonly used evaluation metrics and statistical test methods in machine learning (ML). Specifically, the paper aims to address the following issues: 1. **Selection of Evaluation Metrics**: How to choose appropriate evaluation metrics for different supervised learning tasks (such as binary classification, multi-class classification, regression, image segmentation, object detection, and information retrieval). 2. **Application of Statistical Tests**: How to select suitable statistical tests to compare the performance of different models, explain how to obtain a sufficient number of metric values for testing, and how to perform these tests and interpret the results. 3. **Practical Case Analysis**: Through specific case studies (such as using convolutional neural networks (CNN) to classify X-rays to identify lung infections, and detecting cancer tumors in positron emission tomography (PET) images), demonstrate how to compare the performance of different models. Overall, this paper aims to help researchers who may not be familiar with basic statistical concepts to better understand and apply various evaluation metrics and statistical test methods, thereby improving their research level in the field of machine learning.