Abstract:The recently proposed data augmentation TransMix employs attention labels to help visual transformers (ViT) achieve better robustness and performance. However, TransMix is deficient in two aspects: 1) The image cropping method of TransMix may not be suitable for ViTs. 2) At the early stage of training, the model produces unreliable attention maps. TransMix uses unreliable attention maps to compute mixed attention labels that can affect the model. To address the aforementioned issues, we propose MaskMix and Progressive Attention Labeling (PAL) in image and label space, respectively. In detail, from the perspective of image space, we design MaskMix, which mixes two images based on a patch-like grid mask. In particular, the size of each mask patch is adjustable and is a multiple of the image patch size, which ensures each image patch comes from only one image and contains more global contents. From the perspective of label space, we design PAL, which utilizes a progressive factor to dynamically re-weight the attention weights of the mixed attention label. Finally, we combine MaskMix and Progressive Attention Labeling as our new data augmentation method, named MixPro. The experimental results show that our method can improve various ViT-based models at scales on ImageNet classification (73.8\% top-1 accuracy based on DeiT-T for 300 epochs). After being pre-trained with MixPro on ImageNet, the ViT-based models also demonstrate better transferability to semantic segmentation, object detection, and instance segmentation. Furthermore, compared to TransMix, MixPro also shows stronger robustness on several benchmarks. The code is available at <a class="link-external link-https" href="https://github.com/fistyee/MixPro" rel="external noopener nofollow">this https URL</a>.

ITMix: Image-Text Mix Augmentation for Transferring CLIP to Image Classification

TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

TransMix: Attend to Mix for Vision Transformers

TransformMix: Learning Transformation and Mixing Strategies from Data

SMMix: Self-Motivated Image Mixing for Vision Transformers

MixGen: A New Multi-Modal Data Augmentation

DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification

MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer

MiAMix: Enhancing Image Classification through a Multi-stage Augmented Mixed Sample Data Augmentation Method

MixIR: Mixing Input and Representations for Contrastive Learning

PatchMix: patch-level mixup for data augmentation in convolutional neural networks

MixCut:A Data Augmentation Method for Facial Expression Recognition

Text-Guided Mixup Towards Long-Tailed Image Categorization

CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

SUMix: Mixup with Semantic and Uncertain Information

Geodesic Multi-Modal Mixup for Robust Fine-Tuning

AttentionMix: Data augmentation method that relies on BERT attention mechanism

Mixup Augmentation with Multiple Interpolations

Inter-Instance Similarity Modeling for Contrastive Learning

OpenMixup: A Comprehensive Mixup Benchmark for Visual Classification

AutoMix: Unveiling the Power of Mixup for Stronger Classifiers