FluidML: Fast and Memory Efficient Inference Optimization

Jinjie Liu,Hang Qiu
2024-11-14
Abstract:Machine learning models deployed on edge devices have enabled numerous exciting new applications, such as humanoid robots, AR glasses, and autonomous vehicles. However, the computing resources available on these edge devices are not catching up with the ever-growing number of parameters in these models. As the models become bigger and more complicated, the novel yet sophisticated structure challenges the inference runtime optimization. We present FluidML, a generic runtime memory management and optimization framework that can flexibly transform the model execution blueprint to achieve faster and more memory-efficient inference. Evaluations across different platforms show that FluidML can consistently reduce the end-to-end inference latency by up to 25.38% for popular language models and reduce peak memory usage by up to 41.47%, compared to state-of-the-art approaches. FluidML is of ~30K line of codes, built for general-purpose usage, and will be released as an open-source inference runtime optimization framework to the community.
Machine Learning
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: Machine learning models deployed on edge devices have an increasing number of parameters, which leads to a growing demand for computing resources (such as memory and computing time) during the inference process. Existing optimization methods are difficult to deal with the contradiction between this complexity and resource limitations. Therefore, the paper proposes FLUID ML, a general - purpose runtime memory management and optimization framework, aiming to achieve faster and more efficient inference by flexibly transforming the model execution blueprint. ### Specific Problem Description 1. **Contradiction between Resource Limitations and Model Complexity**: - The computing resources on edge devices are limited, while the number of parameters in modern machine learning models (such as Transformer, base models, etc.) is constantly increasing, resulting in a significant increase in memory occupation and computing time during the inference process. 2. **Limitations of Existing Optimization Methods**: - Existing optimization methods (such as pruning, quantization, knowledge distillation, etc.) mainly focus on reducing the redundancy of the model itself, but in a complex neural network graph, independently optimizing each operator may not achieve the global optimal effect. 3. **Lack of a General - Purpose Optimization Framework**: - There is a lack of a general - purpose, model - independent framework that can provide an overall plan to optimize the flow of numerical calculations throughout the graph. ### FLUID ML Solution FLUID ML solves the above problems in the following ways: - **Jointly Optimize Operator Memory Layout**: By jointly optimizing the entire graph, ensure that the memory layout of operators is optimal throughout the graph. - **Generate Memory - Access - Friendly Execution Blueprint**: By optimizing the memory access pattern, accelerate the execution of operators and end - to - end graphs. - **Reduce Peak Memory Usage**: By carefully allocating resources, reduce the peak memory usage during the inference process. - **Provide Virtual Machines to Collect Data**: By using virtual machines to collect actual performance data, help the compiler find the best scheduling scheme. ### Experimental Results The experimental results show that FLUID ML reduces the inference latency of popular language models (such as BERT) and other commonly used operators (such as MatMul) by up to 25.38% and the peak memory usage by up to 41.47% on multiple platforms (Intel, AMD, Aarch64). In summary, the goal of FLUID ML is to solve the resource limitations and complexity problems faced by modern machine learning models when deployed on edge devices by providing a general - purpose, flexible optimization framework.