MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer

Qihao Zhao,Yangyu Huang,Wei Hu,Fan Zhang,Jun Liu

2023-08-07

Abstract:The recently proposed data augmentation TransMix employs attention labels to help visual transformers (ViT) achieve better robustness and performance. However, TransMix is deficient in two aspects: 1) The image cropping method of TransMix may not be suitable for ViTs. 2) At the early stage of training, the model produces unreliable attention maps. TransMix uses unreliable attention maps to compute mixed attention labels that can affect the model. To address the aforementioned issues, we propose MaskMix and Progressive Attention Labeling (PAL) in image and label space, respectively. In detail, from the perspective of image space, we design MaskMix, which mixes two images based on a patch-like grid mask. In particular, the size of each mask patch is adjustable and is a multiple of the image patch size, which ensures each image patch comes from only one image and contains more global contents. From the perspective of label space, we design PAL, which utilizes a progressive factor to dynamically re-weight the attention weights of the mixed attention label. Finally, we combine MaskMix and Progressive Attention Labeling as our new data augmentation method, named MixPro. The experimental results show that our method can improve various ViT-based models at scales on ImageNet classification (73.8\% top-1 accuracy based on DeiT-T for 300 epochs). After being pre-trained with MixPro on ImageNet, the ViT-based models also demonstrate better transferability to semantic segmentation, object detection, and instance segmentation. Furthermore, compared to TransMix, MixPro also shows stronger robustness on several benchmarks. The code is available at <a class="link-external link-https" href="https://github.com/fistyee/MixPro" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper primarily addresses the following issues: 1. **Image Cropping Methods Unsuitable for ViT**: - The image cropping methods used by TransMix may not be suitable for Vision Transformers (ViT) because ViT has a global receptive field, and region-based mixed images may not provide sufficient image content. 2. **Unreliable Attention Maps in Early Training**: - In the early stages of training, the attention maps generated by the model are unreliable. TransMix uses these unreliable attention maps to compute mixed attention labels, which can affect the model's performance. To address these issues, the authors propose two new methods: - **MaskMix**: A grid mask-based method designed in the image space, which mixes two images according to a grid mask. Each mask block size is a multiple of the image patch size, ensuring that each image patch comes from only one image and contains more global information. - **Progressive Attention Labeling (PAL)**: A progressive factor (α) designed in the label space, used to dynamically reweight the attention weights of mixed attention labels, thereby better adjusting the reliability of attention maps during training. By combining these two methods, the authors propose a new data augmentation method—MixPro, to improve the generalization ability and robustness of ViT-based models. Experimental results show that MixPro significantly enhances the performance of various ViT-based models on the ImageNet classification task and also improves the performance on downstream tasks (such as semantic segmentation, object detection, and instance segmentation) after pre-training. Additionally, compared to TransMix, MixPro demonstrates stronger robustness across multiple benchmarks.

MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer

TransMix: Attend to Mix for Vision Transformers

MixFormer: a Mixed CNN-Transformer Backbone for Medical Image Segmentation

SMMix: Self-Motivated Image Mixing for Vision Transformers

TransformMix: Learning Transformation and Mixing Strategies from Data

AMPLIFY:Attention-based Mixup for Performance Improvement and Label Smoothing in Transformer

AMPLIFY: attention-based mixup for performance improvement and label smoothing in transformer

ITMix: Image-Text Mix Augmentation for Transferring CLIP to Image Classification

AMixer: Adaptive Weight Mixing for Self-attention Free Vision Transformers.

DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models

MixGen: A New Multi-Modal Data Augmentation

Advancing Vision Transformers with Group-Mix Attention

VideoMix: Rethinking Data Augmentation for Video Classification

AttentionMix: Data augmentation method that relies on BERT attention mechanism

Augmenting Prototype Network with TransMix for Few-shot Hyperspectral Image Classification

Data Augmentation Vision Transformer for Fine-grained Image Classification

MiAMix: Enhancing Image Classification through a Multi-stage Augmented Mixed Sample Data Augmentation Method

TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data

MixPro: Simple yet Effective Data Augmentation for Prompt-based Learning

Inter-Instance Similarity Modeling for Contrastive Learning