Abstract:Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging due to high computational demands. To expedite pre-trained ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in the computation. However, these methods still have some limitations, such as image information loss from pruned tokens and inefficiency in the token-matching process. In this paper, we introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs. Inspired by graph summarization algorithms, GTP meticulously propagates less significant tokens' information to spatially and semantically connected tokens that are of greater importance. Consequently, the remaining few tokens serve as a summarization of the entire token graph, allowing the method to reduce computational complexity while preserving essential information of eliminated tokens. Combined with an innovative token selection strategy, GTP can efficiently identify image tokens to be propagated. Extensive experiments have validated GTP's effectiveness, demonstrating both efficiency and performance improvements. Specifically, GTP decreases the computational complexity of both DeiT-S and DeiT-B by up to 26% with only a minimal 0.3% accuracy drop on ImageNet-1K without finetuning, and remarkably surpasses the state-of-the-art token merging method on various backbones at an even faster inference speed. The source code is available at <a class="link-external link-https" href="https://github.com/Ackesnal/GTP-ViT" rel="external noopener nofollow">this https URL</a>.

Vision GNN: An Image is Worth Graph of Nodes

Vision GNN: An Image is Worth Graph of Nodes

PVG: Progressive Vision Graph for Vision Recognition

GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs

GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets

WiGNet: Windowed Vision Graph Neural Network

ViGU: Vision GNN U-Net for Fast MRI

Graph Neural Network (GNN) in Image and Video Understanding Using Deep Learning for Computer Vision Applications

Graph in Graph Neural Network

A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers

Utilizing Edge Features in Graph Neural Networks Via Variational Information Maximization

A Unified and Biologically Plausible Relational Graph Representation of Vision Transformers

ViG-UNet: Vision Graph Neural Networks for Medical Image Segmentation

Adaptive GNN for Image Analysis and Editing.

ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention

Dynamic Graph Message Passing Networks for Visual Recognition

GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

Image and video analysis using graph neural network for Internet of Medical Things and computer vision applications

Tensor-view Topological Graph Neural Network

MobileViG: Graph-Based Sparse Attention for Mobile Vision Applications