Generative and reproducible benchmarks for comprehensive evaluation of machine learning classifiers

Patryk Orzechowski,Jason H. Moore
DOI: https://doi.org/10.48550/arXiv.2107.06475
2021-07-14
Abstract:Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial for determine their scope of application. Here, we introduce the DIverse and GENerative ML Benchmark (DIGEN) - a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of machine learning algorithms for classification of binary outcomes. The DIGEN resource consists of 40 mathematical functions which map continuous features to discrete endpoints for creating synthetic datasets. These 40 functions were discovered using a heuristic algorithm designed to maximize the diversity of performance among multiple popular machine learning algorithms thus providing a useful test suite for evaluating and comparing new methods. Access to the generative functions facilitates understanding of why a method performs poorly compared to other algorithms thus providing ideas for improvement. The resource with extensive documentation and analyses is open-source and available on GitHub.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition,Neural and Evolutionary Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced when evaluating and comparing the performance of machine - learning classification algorithms. Specifically, the author aims to generate a comprehensive, reproducible, and interpretable benchmarking resource to reveal the strengths and weaknesses of different machine - learning classification algorithms when dealing with binary outcomes. ### Main problems: 1. **Limitations of existing benchmark datasets**: Existing real - world and simulated datasets have limitations, such as difficulty in ensuring the real pattern of data, lack of diversity, and inability to fully distinguish the performance of different algorithms. 2. **Need for diverse and interpretable benchmarks**: In order to comprehensively evaluate machine - learning algorithms, a dataset that can produce diverse performance manifestations is required, and these datasets should have clear generation functions so as to understand why an algorithm performs poorly in some cases. 3. **Promote algorithm improvement**: By providing detailed performance analysis and interpretable results, help researchers understand the weaknesses of algorithms and make improvement suggestions. ### Solutions: The author introduced DIverse and GENerative ML Benchmark (DIGEN), a collection of synthetic datasets generated by 40 mathematical functions for evaluating and comparing the performance of machine - learning classification algorithms. Each dataset has a known generation function, allowing users to generate any number of replicated datasets and expand the dataset by adjusting the sample size and feature distribution. ### Main features of DIGEN: - **Diversity**: Ensure the diversity of datasets by maximizing the performance differences among multiple popular machine - learning algorithms. - **Interpretability**: Provide generation functions to help understand why a certain algorithm performs poorly on a specific dataset. - **Reproducibility**: All datasets and analysis results are reproducible, and Docker containers are provided to ensure cross - platform consistency. - **Scalability**: Users can generate any number of datasets as needed and adjust the sample size and feature distribution. - **Open - source**: All codes and datasets are open - source, facilitating expansion and verification. Through these features, DIGEN provides a powerful tool for the machine - learning community, not only for evaluating and comparing different classification algorithms, but also for helping researchers deeply understand the advantages and disadvantages of algorithms, thereby promoting further improvement of algorithms.