Abstract:Vision Transformers (ViTs) have recently made a splash in computer vision domain and achieved state-of-the-art in many vision tasks. Nevertheless, due to their vast model size and high computational costs, rare transformer-based models are adopted in real-world applications. Since the computational costs of attention operation is the square of the input size, some compression methods for the Multi-Head Self-Attention (MHSA) module have been proposed, reducing its FLOPs successfully but almost without parameters reduction. Meanwhile, the number of parameters and computational costs in the Feed-Forward Network (FFN) module exceeds the MHSA larger, while its compression technologies have not been delved deeper. Consequently, we focus our insight on the compression of FFN layer and present a pruning method named Multi-Dimension Compression of Feed-Forward Network in Vision Transformers(MCF), which greatly reduces the model’s parameters and computational costs. Firstly, we identify the critical elements in the output of the FFN module and then employ them to guide the irregular sparsity of this layer, recognizing insignificant elements of FFN layer that have less impact on the output. Successively, to discard the insignificant elements, we transform the irregular sparsity into regular sparsity and prune them, thus reducing the models’ parameters and getting a substantial speed-up during inference. Extensive results on ImageNet-1K validate the effectiveness of our proposed method, which obtains significant parameters and computational costs reduction with almost unimpaired generalization. For example, we compress DeiT-Tiny with 42% reduction in FLOPs and 33% reduction in parameters, almost without losing accuracy on the ImageNet dataset. Further, we verify the effectiveness of our method in the downstream task, using the pruned DeiT-Small as the backbone for the object detection task on the COCO dataset, gaining revenue without compromising its performance.

Image super-resolution: prefix-tuning transformer from large to small datasets

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

Self-Reference Image Super-Resolution via Pre-trained Diffusion Large Model and Window Adjustable Transformer

CSwT-SR: Conv-Swin Transformer for Blind Remote Sensing Image Super-Resolution with Amplitude-Phase Learning and Structural Detail Alternating Learning

Exploring Frequency-Inspired Optimization in Transformer for Efficient Single Image Super-Resolution

Image Super-resolution via Efficient Transformer Embedding Frequency Decomposition with Restart

Activated Sparsely Sub-Pixel Transformer for Remote Sensing Image Super-Resolution

Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach

Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network for Remote Sensing Image Super-Resolution

Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary

TTST: A Top-k Token Selective Transformer for Remote Sensing Image Super-Resolution

HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

Efficient Transformer for Single Image Super-Resolution.

Transformer-based image super-resolution and its lightweight

AdaFormer: Efficient Transformer with Adaptive Token Sparsification for Image Super-resolution

Super-Resolution Algorithm Based on Transformer+CNN

An efficient parallel fusion structure of distilled and transformer-enhanced modules for lightweight image super-resolution

Visual Fourier Prompt Tuning

Task-Aware Dynamic Transformer for Efficient Arbitrary-Scale Image Super-Resolution

An Efficient Hybrid CNN-Transformer Approach for Remote Sensing Super-Resolution