Abstract:Recent studies have integrated convolution into transformers to introduce inductive bias and improve generalization performance. However, the static nature of conventional convolution prevents it from dynamically adapting to input variations, resulting in a representation discrepancy between convolution and self-attention as self-attention calculates attention matrices dynamically. Furthermore, when stacking token mixers that consist of convolution and self-attention to form a deep network, the static nature of convolution hinders the fusion of features previously generated by self-attention into convolution kernels. These two limitations result in a sub-optimal representation capacity of the constructed networks. To find a solution, we propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way. D-Mixer works by applying an efficient global attention module and an input-dependent depthwise convolution separately on evenly split feature segments, endowing the network with strong inductive bias and an enlarged effective receptive field. We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network that delivers compelling performance. In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost. Furthermore, TransXNet-S and TransXNet-B exhibit excellent model scalability, achieving top-1 accuracy of 83.8% and 84.6% respectively, with reasonable computational costs. Additionally, our proposed network architecture demonstrates strong generalization capabilities in various dense prediction tasks, outperforming other state-of-the-art networks while having lower computational costs. Code is available at <a class="link-external link-https" href="https://github.com/LMMMEng/TransXNet" rel="external noopener nofollow">this https URL</a>.

TransConvNet: Perform perceptually relevant driver's visual attention predictions

Driver attention prediction based on convolution and transformers

A lightweight model combining convolutional neural network and Transformer for driver distraction recognition

Multimodal driver distraction detection using dual-channel network of CNN and Transformer

Improving real-time driver distraction detection via constrained attention mechanism

VisionNet: A Drivable-space-based Interactive Motion Prediction Network for Autonomous Driving

FAM: Improving columnar vision transformer with feature attention mechanism

FBLNet: FeedBack Loop Network for Driver Attention Prediction

Research on Visual Perception Technology of Autonomous Driving Based on Improved Convolutional Neural Network

Convolution-enhanced Evolving Attention Networks

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Multisource Adaption for Driver Attention Prediction in Arbitrary Driving Scenes

Target-point Attention Transformer: A novel trajectory predict network for end-to-end autonomous driving

Improved Attention Mechanism for Human-like Intelligent Vehicle Trajectory Prediction

TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

Surrounding-aware representation prediction in Birds-Eye-View using transformers

MmSTCT: spatial–temporal convolution transformer network considering driving intention for multimodal vehicle trajectory prediction of highway

PEDTrans: A Fine-Grained Visual Classification Model for Self-attention Patch Enhancement and Dropout.

Recent advancements in driver's attention prediction