EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

Han Cai,Junyan Li,Muyan Hu,Chuang Gan,Song Han

2024-02-06

Abstract:High-resolution dense prediction enables many appealing real-world applications, such as computational photography, autonomous driving, etc. However, the vast computational cost makes deploying state-of-the-art high-resolution dense prediction models on hardware devices difficult. This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Unlike prior high-resolution dense prediction models that rely on heavy softmax attention, hardware-inefficient large-kernel convolution, or complicated topology structure to obtain good performances, our multi-scale linear attention achieves the global receptive field and multi-scale learning (two desirable features for high-resolution dense prediction) with only lightweight and hardware-efficient operations. As such, EfficientViT delivers remarkable performance gains over previous state-of-the-art models with significant speedup on diverse hardware platforms, including mobile CPU, edge GPU, and cloud GPU. Without performance loss on Cityscapes, our EfficientViT provides up to 13.9$\times$ and 6.2$\times$ GPU latency reduction over SegFormer and SegNeXt, respectively. For super-resolution, EfficientViT delivers up to 6.4x speedup over Restormer while providing 0.11dB gain in PSNR. For Segment Anything, EfficientViT delivers 48.9x higher throughput on A100 GPU while achieving slightly better zero-shot instance segmentation performance on COCO.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the issue of high computational cost in high-resolution dense prediction tasks, making the deployment of these models on hardware devices impractical. Specifically, the paper proposes EfficientViT, a new generation vision model based on a novel multi-scale linear attention mechanism. Compared to previous high-resolution dense prediction models, EfficientViT significantly reduces computational complexity and latency while maintaining or even enhancing performance. The core contributions of EfficientViT are: 1. **Multi-scale Linear Attention Module**: By introducing ReLU linear attention to replace traditional softmax attention, it achieves global receptive field and multi-scale learning capabilities while avoiding hardware-inefficient operations. 2. **Efficient Model Design**: EfficientViT demonstrates significant speed improvements on various hardware platforms (including mobile CPUs, edge GPUs, and cloud GPUs) and achieves excellent performance on datasets such as Cityscapes and ADE20K. 3. **Wide Application Range**: In addition to semantic segmentation and super-resolution tasks, EfficientViT has also been successfully applied to the Segment Anything task, achieving up to 48.9 times acceleration on an A100 GPU. In summary, the main problem addressed by the paper is to significantly reduce the computational cost and latency of high-resolution dense prediction models while ensuring their performance, making them more suitable for practical application scenarios.

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

AiluRus: A Scalable ViT Framework for Dense Prediction

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

FasterViT: Fast Vision Transformers with Hierarchical Attention

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

Vision Transformers: From Semantic Segmentation to Dense Prediction

Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

HydraViT: Stacking Heads for a Scalable ViT

Vision Transformers for Dense Prediction

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs

DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices

SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

ScopeViT: Scale-aware Vision Transformer

FMViT: A multiple-frequency mixing Vision Transformer

ViR: Towards Efficient Vision Retention Backbones

MobileViG: Graph-Based Sparse Attention for Mobile Vision Applications