Abstract:Machine Learning (ML) and Deep Learning (DL) innovations are being introduced at such a rapid pace that model owners and evaluators are hard-pressed analyzing and studying them. This is exacerbated by the complicated procedures for evaluation. The lack of standard systems and efficient techniques for specifying and provisioning ML/DL evaluation is the main cause of this "pain point". This work discusses common pitfalls for replicating DL model evaluation, and shows that these subtle pitfalls can affect both accuracy and performance. It then proposes a solution to remedy these pitfalls called MLModelScope, a specification for repeatable model evaluation and a runtime to provision and measure experiments. We show that by easing the model specification and evaluation process, MLModelScope facilitates rapid adoption of ML/DL innovations.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the complexity and reproducibility problems encountered in the evaluation process of machine learning (ML) and deep learning (DL) models. Specifically, the paper points out that the current ML/DL model sharing and evaluation methods have the following problems: 1. **Complexity of the hardware and software stack**: - ML/DL model evaluation depends on multiple hardware and software abstraction layers (such as application pipelines, model definitions, framework executions, library calls, and hardware instruction executions). These layers must work together to maintain the reported accuracy and performance. - Setting up and configuring these hardware and software stacks is very complex and usually requires detailed documentation support, but the existing documentation is often insufficient. 2. **Common pitfalls when replicating model declarations**: - **Pre - processing/Post - processing**: If the pre - processing of input data and the post - processing of output data are not properly handled, they may introduce subtle errors, resulting in inconsistent results. - **Software stack**: Different versions of frameworks and libraries (such as TensorFlow, PyTorch, MKL - DNN, etc.) will affect the accuracy and performance of the evaluation. - **Hardware configuration**: Different hardware configurations (such as CPU extensions, multithreading, vectorization, etc.) will also have an impact on performance and accuracy. - **Programming language selection**: Different programming languages (such as Python and C/C++) and their numerical representation methods will have a significant impact on performance. 3. **Lack of standardized evaluation norms**: - Although the existing model sharing methods facilitate academic exchanges, it is still very difficult for ordinary users to understand and reproduce these models. Even experts need to make great efforts to reproduce the model results of others. To solve these problems, the paper proposes a solution named **MLModelScope**. MLModelScope simplifies and standardizes the model evaluation process in the following ways: - **Model Manifest**: Provide a text - format specification for standardizing model sharing and avoiding the identified pitfalls. Model owners can easily share their models without having to write complete documentation. - **Runtime System**: Use the model manifest as input to simplify the model evaluation process and make it accessible to both ordinary users and experts. - **Extensible data collection and analysis pipeline**: Help simplify the process of model understanding, analysis, and comparison. Through these improvements, MLModelScope aims to accelerate the application and promotion of ML/DL innovation, enabling more people to easily evaluate and use these models.

Frustrated with Replicating Claims of a Shared Model? A Solution

MLModelScope: A Distributed Platform for Model Evaluation and Benchmarking at Scale

From Model Performance to Claim: How a Change of Focus in Machine Learning Replicability Can Help Bridge the Responsibility Gap

On the Reproducibility and Replicability of Deep Learning in Software Engineering

Evaluation Gaps in Machine Learning Practice

Lessons from the Trenches on Reproducible Evaluation of Language Models

Rethinking Model Evaluation as Narrowing the Socio-Technical Gap

Evaluating Representations with Readout Model Switching

An Experience Report on Machine Learning Reproducibility: Guidance for Practitioners and TensorFlow Model Garden Contributors

Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models

Don't Make Your LLM an Evaluation Benchmark Cheater

Toward a `Standard Model' of Machine Learning

Model Reuse with Reduced Kernel Mean Embedding Specification

MLTEing Models: Negotiating, Evaluating, and Documenting Model and System Qualities

Questionable practices in machine learning

A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions

Towards Inferential Reproducibility of Machine Learning Research

Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements

Beyond development: Challenges in deploying machine learning models for structural engineering applications

An Efficient Model Maintenance Approach for MLOps