Abstract:Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial for determine their scope of application. Here, we introduce the DIverse and GENerative ML Benchmark (DIGEN) - a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of machine learning algorithms for classification of binary outcomes. The DIGEN resource consists of 40 mathematical functions which map continuous features to discrete endpoints for creating synthetic datasets. These 40 functions were discovered using a heuristic algorithm designed to maximize the diversity of performance among multiple popular machine learning algorithms thus providing a useful test suite for evaluating and comparing new methods. Access to the generative functions facilitates understanding of why a method performs poorly compared to other algorithms thus providing ideas for improvement. The resource with extensive documentation and analyses is open-source and available on GitHub.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced when evaluating and comparing the performance of machine - learning classification algorithms. Specifically, the author aims to generate a comprehensive, reproducible, and interpretable benchmarking resource to reveal the strengths and weaknesses of different machine - learning classification algorithms when dealing with binary outcomes. ### Main problems: 1. **Limitations of existing benchmark datasets**: Existing real - world and simulated datasets have limitations, such as difficulty in ensuring the real pattern of data, lack of diversity, and inability to fully distinguish the performance of different algorithms. 2. **Need for diverse and interpretable benchmarks**: In order to comprehensively evaluate machine - learning algorithms, a dataset that can produce diverse performance manifestations is required, and these datasets should have clear generation functions so as to understand why an algorithm performs poorly in some cases. 3. **Promote algorithm improvement**: By providing detailed performance analysis and interpretable results, help researchers understand the weaknesses of algorithms and make improvement suggestions. ### Solutions: The author introduced DIverse and GENerative ML Benchmark (DIGEN), a collection of synthetic datasets generated by 40 mathematical functions for evaluating and comparing the performance of machine - learning classification algorithms. Each dataset has a known generation function, allowing users to generate any number of replicated datasets and expand the dataset by adjusting the sample size and feature distribution. ### Main features of DIGEN: - **Diversity**: Ensure the diversity of datasets by maximizing the performance differences among multiple popular machine - learning algorithms. - **Interpretability**: Provide generation functions to help understand why a certain algorithm performs poorly on a specific dataset. - **Reproducibility**: All datasets and analysis results are reproducible, and Docker containers are provided to ensure cross - platform consistency. - **Scalability**: Users can generate any number of datasets as needed and adjust the sample size and feature distribution. - **Open - source**: All codes and datasets are open - source, facilitating expansion and verification. Through these features, DIGEN provides a powerful tool for the machine - learning community, not only for evaluating and comparing different classification algorithms, but also for helping researchers deeply understand the advantages and disadvantages of algorithms, thereby promoting further improvement of algorithms.

Generative and reproducible benchmarks for comprehensive evaluation of machine learning classifiers

PMLB: a large benchmark suite for machine learning evaluation and comparison

A Comprehensive Assessment Benchmark for Rigorously Evaluating Deep Learning Image Classifiers

A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets

Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Benchmarking and Analyzing Generative Data for Visual Recognition

The Benchmark Lottery

Scientific Machine Learning Benchmarks

Attribute Based Interpretable Evaluation Metrics for Generative Models

A Methodology and an Empirical Analysis to Determine the Most Suitable Synthetic Data Generator

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

Can a Transparent Machine Learning Algorithm Predict Better than Its Black Box Counterparts? A Benchmarking Study Using 110 Data Sets

Debiasing Synthetic Data Generated by Deep Generative Models

Diverse Genomic Embedding Benchmark for functional evaluation across the tree of life

SIDBench: A Python Framework for Reliably Assessing Synthetic Image Detection Methods

Comprehensive Exploration of Synthetic Data Generation: A Survey

Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems

How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models

Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study

Efficacy of Synthetic Data as a Benchmark