Abstract:Vision Transformers have attracted a lot of attention recently since the successful implementation of Vision Transformer (ViT) on vision tasks. With vision Transformers, specifically the multi-head self-attention modules, networks can capture long-term dependencies inherently. However, these attention modules normally need to be trained on large datasets, and vision Transformers show inferior performance on small datasets when training from scratch compared with widely dominant backbones like ResNets. Note that the Transformer model was first proposed for natural language processing, which carries denser information than natural images. To boost the performance of vision Transformers on small datasets, this paper proposes to explicitly increase the input information density in the frequency domain. Specifically, we introduce selecting channels by calculating the channel-wise heatmaps in the frequency domain using Discrete Cosine Transform (DCT), reducing the size of input while keeping most information and hence increasing the information density. As a result, 25% fewer channels are kept while better performance is achieved compared with previous work. Extensive experiments demonstrate the effectiveness of the proposed approach on five small-scale datasets, including CIFAR-10/100, SVHN, Flowers-102, and Tiny ImageNet. The accuracy has been boosted up to 17.05% with Swin and Focal Transformers. Codes are available at <a class="link-external link-https" href="https://github.com/xiangyu8/DenseVT" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the poor performance when training visual Transformers on small - scale datasets. Specifically, although visual Transformers perform excellently on large - scale datasets, when trained from scratch on small - scale datasets, their performance is usually not as good as that of traditional convolutional neural networks (such as ResNet). This is mainly because the Transformer model was originally designed for natural language processing, and the information density in natural language is higher than that in natural images, so there is a gap when directly applied in image tasks. To solve this problem, the paper proposes a method to improve the performance of visual Transformers on small - scale datasets by explicitly increasing the input information density. The specific method is to select useful channels in the frequency domain, reduce the input size while retaining most of the information, thereby increasing the information density. The experimental results on multiple small - scale datasets show its effectiveness, especially on the CIFAR - 10/100, SVHN, Flowers - 102 and Tiny ImageNet datasets, when using Swin and Focal Transformers, the accuracy is significantly improved. ### Main contributions of the paper: 1. **Explicitly increasing input information density**: It is the first time to propose explicitly increasing the input information density in visual Transformers to narrow the gap between language and image inputs. 2. **Channel selection strategy based on channel heatmaps**: A simple and effective strategy based on channel heatmaps is designed to select useful DCT frequency channels. Compared with previous work, this method retains fewer channels but has better results. 3. **Experimental verification**: Extensive experiments were carried out on the Tiny ImageNet, CIFAR - 10/100, Flowers - 102 and SVHN datasets to verify the effectiveness of learning in the DCT frequency domain. ### Method overview: 1. **Block DCT**: Convert RGB images to YCbCr representation, then perform DCT transformation on each small block, reorganize the DCT coefficients, and remove useless frequency channels. 2. **Channel heatmap**: Use the GradCAM method to calculate the heatmap of each frequency channel and select the channels that contribute the most to the final high - level features. 3. **Frequency channel selection and information density**: By selecting low - frequency channels, reduce the size of the frequency representation while retaining most of the information, thereby significantly increasing the information density. ### Experimental results: - On the CIFAR - 100 dataset, the Swin Transformer with Dense DCT input has an accuracy improvement of 4.80% compared to RGB input. - On the Flowers - 102 dataset, the Swin Transformer with Dense DCT input has an accuracy improvement of 17.05% compared to RGB input. - Similar effects were also observed on other datasets, indicating the effectiveness of this method on small - scale datasets. ### Conclusion: The method proposed in the paper significantly improves the performance of visual Transformers on small - scale datasets by explicitly increasing the input information density. This method is not only simple and effective but also easy to implement and can be applied to other tasks, reducing the communication cost between GPU and CPU while improving the overall performance.

Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

GhostViT: Expediting Vision Transformers Via Cheap Operations

Lightweight Vision Transformer for Small Data Sets

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Vision Transformers for Dense Prediction

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Locality Guidance for Improving Vision Transformers on Tiny Datasets.

Vision Transformers in 2022: An Update on Tiny ImageNet

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

Not All Images Are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Vision Transformers: From Semantic Segmentation to Dense Prediction

A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation

Improving Vision Transformers by Revisiting High-Frequency Components

DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

DctViT: Discrete Cosine Transform Meet Vision Transformers

Vicinity Vision Transformer

FDViT: Improve the Hierarchical Architecture of Vision Transformer.

DeepViT: Towards Deeper Vision Transformer