Abstract:Recent studies have integrated convolution into transformers to introduce inductive bias and improve generalization performance. However, the static nature of conventional convolution prevents it from dynamically adapting to input variations, resulting in a representation discrepancy between convolution and self-attention as self-attention calculates attention matrices dynamically. Furthermore, when stacking token mixers that consist of convolution and self-attention to form a deep network, the static nature of convolution hinders the fusion of features previously generated by self-attention into convolution kernels. These two limitations result in a sub-optimal representation capacity of the constructed networks. To find a solution, we propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way. D-Mixer works by applying an efficient global attention module and an input-dependent depthwise convolution separately on evenly split feature segments, endowing the network with strong inductive bias and an enlarged effective receptive field. We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network that delivers compelling performance. In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost. Furthermore, TransXNet-S and TransXNet-B exhibit excellent model scalability, achieving top-1 accuracy of 83.8% and 84.6% respectively, with reasonable computational costs. Additionally, our proposed network architecture demonstrates strong generalization capabilities in various dense prediction tasks, outperforming other state-of-the-art networks while having lower computational costs. Code is available at <a class="link-external link-https" href="https://github.com/LMMMEng/TransXNet" rel="external noopener nofollow">this https URL</a>.

NiNformer: A Network in Network Transformer with Token Mixing as a Gating Function Generator

TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition

Adaptive Frequency Filters As Efficient Global Token Mixers

MetaMixer Is All You Need

MLP Can Be A Good Transformer Learner

AMixer: Adaptive Weight Mixing for Self-attention Free Vision Transformers.

[Aggressive fibromatosis in childhood].

SCHEME: Scalable Channel Mixer for Vision Transformers

Hierarchical Associative Memory, Parallelized MLP-Mixer, and Symmetry Breaking

Demystify Transformers & Convolutions in Modern Image Deep Networks

Activator: GLU Activation Function as the Core Component of a Vision Transformer

Bacterial supersystem for alginate import/metabolism and its environmental and bioenergy applications

FAM: Improving columnar vision transformer with feature attention mechanism

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

FMViT: A multiple-frequency mixing Vision Transformer

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

Rethinking Token-Mixing MLP for MLP-based Vision Backbone

Token Fusion: Bridging the Gap between Token Pruning and Token Merging

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification.

DMFormer: Closing the Gap Between CNN and Vision Transformers

MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers