Abstract:Machine learning models deployed on edge devices have enabled numerous exciting new applications, such as humanoid robots, AR glasses, and autonomous vehicles. However, the computing resources available on these edge devices are not catching up with the ever-growing number of parameters in these models. As the models become bigger and more complicated, the novel yet sophisticated structure challenges the inference runtime optimization. We present FluidML, a generic runtime memory management and optimization framework that can flexibly transform the model execution blueprint to achieve faster and more memory-efficient inference. Evaluations across different platforms show that FluidML can consistently reduce the end-to-end inference latency by up to 25.38% for popular language models and reduce peak memory usage by up to 41.47%, compared to state-of-the-art approaches. FluidML is of ~30K line of codes, built for general-purpose usage, and will be released as an open-source inference runtime optimization framework to the community.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: Machine learning models deployed on edge devices have an increasing number of parameters, which leads to a growing demand for computing resources (such as memory and computing time) during the inference process. Existing optimization methods are difficult to deal with the contradiction between this complexity and resource limitations. Therefore, the paper proposes FLUID ML, a general - purpose runtime memory management and optimization framework, aiming to achieve faster and more efficient inference by flexibly transforming the model execution blueprint. ### Specific Problem Description 1. **Contradiction between Resource Limitations and Model Complexity**: - The computing resources on edge devices are limited, while the number of parameters in modern machine learning models (such as Transformer, base models, etc.) is constantly increasing, resulting in a significant increase in memory occupation and computing time during the inference process. 2. **Limitations of Existing Optimization Methods**: - Existing optimization methods (such as pruning, quantization, knowledge distillation, etc.) mainly focus on reducing the redundancy of the model itself, but in a complex neural network graph, independently optimizing each operator may not achieve the global optimal effect. 3. **Lack of a General - Purpose Optimization Framework**: - There is a lack of a general - purpose, model - independent framework that can provide an overall plan to optimize the flow of numerical calculations throughout the graph. ### FLUID ML Solution FLUID ML solves the above problems in the following ways: - **Jointly Optimize Operator Memory Layout**: By jointly optimizing the entire graph, ensure that the memory layout of operators is optimal throughout the graph. - **Generate Memory - Access - Friendly Execution Blueprint**: By optimizing the memory access pattern, accelerate the execution of operators and end - to - end graphs. - **Reduce Peak Memory Usage**: By carefully allocating resources, reduce the peak memory usage during the inference process. - **Provide Virtual Machines to Collect Data**: By using virtual machines to collect actual performance data, help the compiler find the best scheduling scheme. ### Experimental Results The experimental results show that FLUID ML reduces the inference latency of popular language models (such as BERT) and other commonly used operators (such as MatMul) by up to 25.38% and the peak memory usage by up to 41.47% on multiple platforms (Intel, AMD, Aarch64). In summary, the goal of FLUID ML is to solve the resource limitations and complexity problems faced by modern machine learning models when deployed on edge devices by providing a general - purpose, flexible optimization framework.

FluidML: Fast and Memory Efficient Inference Optimization

Towards Memory-Efficient Inference in Edge Video Analytics

FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices.

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models

Inference Performance Optimization for Large Language Models on CPUs

Memory-efficient Energy-adaptive Inference of Pre-Trained Models on Batteryless Embedded Systems

Efficient Memory Management for Deep Neural Net Inference

Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures

Enabling One-size-fits-all Compilation Optimization across Machine Learning Computers for Inference

MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices

Enabling One-Size-Fits-All Compilation Optimization for Inference Across Machine Learning Computers

MemFlow: Optical Flow Estimation and Prediction with Memory

FLUID: A Unified Evaluation Framework for Flexible Sequential Data

FlashDecoding++: Faster Large Language Model Inference on GPUs

AStitch: Enabling a New Multi-Dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures

An Energy-Efficient Architecture for Accelerating Inference of Memory-Augmented Neural Networks

AdaLomo: Low-memory Optimization with Adaptive Learning Rate