EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

Han Cai,Junyan Li,Muyan Hu,Chuang Gan,Song Han
2024-02-06
Abstract:High-resolution dense prediction enables many appealing real-world applications, such as computational photography, autonomous driving, etc. However, the vast computational cost makes deploying state-of-the-art high-resolution dense prediction models on hardware devices difficult. This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Unlike prior high-resolution dense prediction models that rely on heavy softmax attention, hardware-inefficient large-kernel convolution, or complicated topology structure to obtain good performances, our multi-scale linear attention achieves the global receptive field and multi-scale learning (two desirable features for high-resolution dense prediction) with only lightweight and hardware-efficient operations. As such, EfficientViT delivers remarkable performance gains over previous state-of-the-art models with significant speedup on diverse hardware platforms, including mobile CPU, edge GPU, and cloud GPU. Without performance loss on Cityscapes, our EfficientViT provides up to 13.9$\times$ and 6.2$\times$ GPU latency reduction over SegFormer and SegNeXt, respectively. For super-resolution, EfficientViT delivers up to 6.4x speedup over Restormer while providing 0.11dB gain in PSNR. For Segment Anything, EfficientViT delivers 48.9x higher throughput on A100 GPU while achieving slightly better zero-shot instance segmentation performance on COCO.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issue of high computational cost in high-resolution dense prediction tasks, making the deployment of these models on hardware devices impractical. Specifically, the paper proposes EfficientViT, a new generation vision model based on a novel multi-scale linear attention mechanism. Compared to previous high-resolution dense prediction models, EfficientViT significantly reduces computational complexity and latency while maintaining or even enhancing performance. The core contributions of EfficientViT are: 1. **Multi-scale Linear Attention Module**: By introducing ReLU linear attention to replace traditional softmax attention, it achieves global receptive field and multi-scale learning capabilities while avoiding hardware-inefficient operations. 2. **Efficient Model Design**: EfficientViT demonstrates significant speed improvements on various hardware platforms (including mobile CPUs, edge GPUs, and cloud GPUs) and achieves excellent performance on datasets such as Cityscapes and ADE20K. 3. **Wide Application Range**: In addition to semantic segmentation and super-resolution tasks, EfficientViT has also been successfully applied to the Segment Anything task, achieving up to 48.9 times acceleration on an A100 GPU. In summary, the main problem addressed by the paper is to significantly reduce the computational cost and latency of high-resolution dense prediction models while ensuring their performance, making them more suitable for practical application scenarios.