Frustrated with Replicating Claims of a Shared Model? A Solution

Abdul Dakkak,Cheng Li,Jinjun Xiong,Wen-Mei Hwu
DOI: https://doi.org/10.48550/arXiv.1811.09737
2019-06-26
Abstract:Machine Learning (ML) and Deep Learning (DL) innovations are being introduced at such a rapid pace that model owners and evaluators are hard-pressed analyzing and studying them. This is exacerbated by the complicated procedures for evaluation. The lack of standard systems and efficient techniques for specifying and provisioning ML/DL evaluation is the main cause of this "pain point". This work discusses common pitfalls for replicating DL model evaluation, and shows that these subtle pitfalls can affect both accuracy and performance. It then proposes a solution to remedy these pitfalls called MLModelScope, a specification for repeatable model evaluation and a runtime to provision and measure experiments. We show that by easing the model specification and evaluation process, MLModelScope facilitates rapid adoption of ML/DL innovations.
Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the complexity and reproducibility problems encountered in the evaluation process of machine learning (ML) and deep learning (DL) models. Specifically, the paper points out that the current ML/DL model sharing and evaluation methods have the following problems: 1. **Complexity of the hardware and software stack**: - ML/DL model evaluation depends on multiple hardware and software abstraction layers (such as application pipelines, model definitions, framework executions, library calls, and hardware instruction executions). These layers must work together to maintain the reported accuracy and performance. - Setting up and configuring these hardware and software stacks is very complex and usually requires detailed documentation support, but the existing documentation is often insufficient. 2. **Common pitfalls when replicating model declarations**: - **Pre - processing/Post - processing**: If the pre - processing of input data and the post - processing of output data are not properly handled, they may introduce subtle errors, resulting in inconsistent results. - **Software stack**: Different versions of frameworks and libraries (such as TensorFlow, PyTorch, MKL - DNN, etc.) will affect the accuracy and performance of the evaluation. - **Hardware configuration**: Different hardware configurations (such as CPU extensions, multithreading, vectorization, etc.) will also have an impact on performance and accuracy. - **Programming language selection**: Different programming languages (such as Python and C/C++) and their numerical representation methods will have a significant impact on performance. 3. **Lack of standardized evaluation norms**: - Although the existing model sharing methods facilitate academic exchanges, it is still very difficult for ordinary users to understand and reproduce these models. Even experts need to make great efforts to reproduce the model results of others. To solve these problems, the paper proposes a solution named **MLModelScope**. MLModelScope simplifies and standardizes the model evaluation process in the following ways: - **Model Manifest**: Provide a text - format specification for standardizing model sharing and avoiding the identified pitfalls. Model owners can easily share their models without having to write complete documentation. - **Runtime System**: Use the model manifest as input to simplify the model evaluation process and make it accessible to both ordinary users and experts. - **Extensible data collection and analysis pipeline**: Help simplify the process of model understanding, analysis, and comparison. Through these improvements, MLModelScope aims to accelerate the application and promotion of ML/DL innovation, enabling more people to easily evaluate and use these models.