Abstract:With the rapid development of Deep Learning, more and more applications on the cloud and edge tend to utilize large DNN (Deep Neural Network) models for improved task execution efficiency as well as decision-making quality. Due to memory constraints, models are commonly optimized using compression, pruning, and partitioning algorithms to become deployable onto resource-constrained devices. As the conditions in the computational platform change dynamically, the deployed optimization algorithms should accordingly adapt their solutions. To perform frequent evaluations of these solutions in a timely fashion, RMs (Regression Models) are commonly trained to predict the relevant solution quality metrics, such as the resulted DNN module inference latency, which is the focus of this paper. Existing prediction frameworks specify different RM training workflows, but none of them allow flexible configurations of the input parameters (e.g., batch size, device utilization rate) and of the selected RMs for different modules. In this paper, a deep learning module inference latency prediction framework is proposed, which i) hosts a set of customizable input parameters to train multiple different RMs per DNN module (e.g., convolutional layer) with self-generated datasets, and ii) automatically selects a set of trained RMs leading to the highest possible overall prediction accuracy, while keeping the prediction time / space consumption as low as possible. Furthermore, a new RM, namely MEDN (Multi-task Encoder-Decoder Network), is proposed as an alternative solution. Comprehensive experiment results show that MEDN is fast and lightweight, and capable of achieving the highest overall prediction accuracy and R-squared value. The Time/Space-efficient Auto-selection algorithm also manages to improve the overall accuracy by 2.5% and R-squared by 0.39%, compared to the MEDN single-selection scheme.

Dual-module Inference for Efficient Recurrent Neural Networks

DUET: Boosting Deep Neural Network Efficiency on Dual-Module Architecture

Unlocking the Non-deterministic Computing Power with Memory-Elastic Multi-Exit Neural Networks

ModularBoost: an Efficient Network Inference Algorithm Based on Module Decomposition.

Towards A Flexible Accuracy-Oriented Deep Learning Module Inference Latency Prediction Framework for Adaptive Optimization Algorithms

Recurrent Residual Module for Fast Inference in Videos

An Energy-Efficient Architecture for Accelerating Inference of Memory-Augmented Neural Networks

Reversible Recurrent Neural Networks

A 3.89-Gops/mw Scalable Recurrent Neural Network Processor with Improved Efficiency on Memory and Computation

ReuseSense: With Great Reuse Comes Greater Efficiency; Effectively Employing Computation Reuse on General-Purpose CPUs

High-Performance Temporal Reversible Spiking Neural Networks with $O(L)$ Training Memory and $O(1)$ Inference Cost

CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation

A Fast and Power Efficient Architecture to Parallelize LSTM based RNN for Cognitive Intelligence Applications.

LightRNN: Memory and Computation-Efficient Recurrent Neural Networks

Energy-Aware Adaptive Multi-Exit Neural Network Inference Implementation for a Millimeter-Scale Sensing System

Saving RNN Computations with a Neuron-Level Fuzzy Memoization Scheme

Exploiting Symmetric Temporally Sparse BPTT for Efficient RNN Training

Sharing Leaky-Integrate-and-Fire Neurons for Memory-Efficient Spiking Neural Networks

Simple Recurrent Units for Highly Parallelizable Recurrence

Memory-Efficient Reversible Spiking Neural Networks

HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference