Abstract:Large-scale Transformer models bring significant improvements for various downstream vision language tasks with a unified architecture. The performance improvements come with increasing model size, resulting in slow inference speed and increased cost for severing. While some certain predictions benefit from the full complexity of the large-scale model, not all of inputs need the same amount of computation to conduct, potentially leading to computation resource waste. To handle this challenge, early exiting is proposed to adaptively allocate computational power in term of input complexity to improve inference efficiency. The existing early exiting strategies usually adopt output confidence based on intermediate layers as a proxy of input complexity to incur the decision of skipping following layers. However, such strategies cannot apply to encoder in the widely-used unified architecture with both encoder and decoder due to difficulty of output confidence estimation in the encoder. It is suboptimal in term of saving computation power to ignore the early exiting in encoder component. To handle this challenge, we propose a novel early exiting strategy for unified visual language models, which allows dynamically skip the layers in encoder and decoder simultaneously in term of input layer-wise similarities with multiple times of early exiting, namely \textbf{MuE}. By decomposing the image and text modalities in the encoder, MuE is flexible and can skip different layers in term of modalities, advancing the inference efficiency while minimizing performance drop. Experiments on the SNLI-VE and MS COCO datasets show that the proposed approach MuE can reduce expected inference time by up to 50\% and 40\% while maintaining 99\% and 96\% performance respectively.

Early Exit with Disentangled Representation and Equiangular Tight Frame.

Early Exiting with Ensemble Internal Classifiers.

A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models.

Joint or Disjoint: Mixing Training Regimes for Early-Exit Models

DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models

An Efficient Inference Framework for Early-exit Large Language Models

You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model

LECO: Improving Early Exiting Via Learned Exits and Comparison-based Exiting Mechanism.

SmartBERT: A Promotion of Dynamic Early Exiting Mechanism for Accelerating BERT Inference.

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference

EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models

BADGE: Speeding Up BERT Inference after Deployment Via Block-wise Bypasses and Divergence-based Early Exiting.

Dynamic Perceiver for Efficient Visual Recognition

Training dynamic models using early exits for automatic speech recognition on resource-constrained devices

ELF: An Early-Exiting Framework for Long-Tailed Classification

BEExformer: A Fast Inferencing Transformer Architecture via Binarization with Multiple Early Exits

LGViT: Dynamic Early Exiting for Accelerating Vision Transformer

Learning to Weight Samples for Dynamic Early-Exiting Networks.

Dynamic Transformers Provide a False Sense of Efficiency

RAEE: A Robust Retrieval-Augmented Early Exiting Framework for Efficient Inference