Abstract:Learning with large-scale unlabeled data has become a powerful tool for pre-training Visual Transformers (VTs). However, prior works tend to overlook that, in real-world scenarios, the input data may be corrupted and unreliable. Pre-training VTs on such corrupted data can be challenging, especially when we pre-train via the masked autoencoding approach, where both the inputs and masked ``ground truth" targets can potentially be unreliable in this case. To address this limitation, we introduce the Token Boosting Module (TBM) as a plug-and-play component for VTs that effectively allows the VT to learn to extract clean and robust features during masked autoencoding pre-training. We provide theoretical analysis to show how TBM improves model pre-training with more robust and generalizable representations, thus benefiting downstream tasks. We conduct extensive experiments to analyze TBM's effectiveness, and results on four corrupted datasets demonstrate that TBM consistently improves performance on downstream tasks.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper attempts to solve the problem of how to improve the ability of Visual Transformers (VT) to extract robust features from noisy and unreliable data during self - supervised VT pre - training. Specifically, the paper focuses on the fact that in real - world scenarios, input data often has quality problems and unreliability. For example, images taken in bad weather conditions, depth images with measurement errors, and skeleton data distortion due to sensor noise. These problems can lead to unreliable features learned by VT when using such data for self - supervised pre - training, thus affecting the performance of downstream tasks. ### Solutions To address this challenge, the paper proposes a new module - Token Boosting Module (TBM), which aims to improve the robustness of VT to unreliable and noisy data through the self - supervised pre - training process. TBM is a plug - and - play module that can be inserted into multiple layers of VT and trained in an end - to - end manner. TBM works in the following ways: 1. **Feature enhancement technique**: TBM uses a feature enhancement technique. By adding synthetic noise to the input features and then reconstructing these features through an auto - encoder, more reliable features are estimated. 2. **Theoretical analysis**: The paper provides theoretical analysis, proving that TBM can learn cleaner and more robust features during the self - supervised pre - training process. 3. **Experimental verification**: The paper conducts experiments on multiple tasks, including RGB image classification, 3D skeleton action recognition, and depth image classification. The results show that TBM significantly improves the performance of downstream tasks when dealing with noisy data. ### Main contributions 1. **Designed Token Boosting Module (TBM)**: TBM can improve the robustness of VT to unreliable and noisy data during the self - supervised pre - training process. By inserting TBM into multiple layers, the performance of VT can be further enhanced. 2. **Theoretical analysis**: The paper provides theoretical analysis, proving that TBM can learn more robust and reliable feature representations through self - supervised pre - training. 3. **Experimental verification**: The paper conducts extensive experiments on multiple tasks to verify the effectiveness of TBM, especially its performance when dealing with noisy data. ### Formula presentation Some of the key formulas involved in the paper are as follows: - **Key formula in feature enhancement technique**: \[ \hat{R} = 2\hat{F}-I \] where \(\hat{F}\) is the feature estimate reconstructed from the intermediate representation \(I\) by the auto - encoder \(g\), and \(I\) is the combination of the original feature \(F\) and the synthetic noise \(Q\): \[ I = F+(\alpha\odot S) \] \(\alpha\) is a learnable scaling parameter, and \(S\) is a Gaussian noise vector. - **L2 reconstruction loss**: \[ L_{\text{recon}}(F,\hat{F})=\lambda\sum_{k = 1}^{K}[F_k-\hat{F}_k]^2 \] where \(\lambda\) is a hyperparameter that controls the loss weight. ### Conclusion By introducing the Token Boosting Module (TBM), the paper successfully improves the robustness of Visual Transformers to noisy data during the self - supervised pre - training process. This not only improves the quality of the pre - trained model but also enables VT to perform downstream tasks better when facing unreliable input data.

Token Boosting for Robust Self-Supervised Visual Transformer Pre-training

MST: Masked Self-Supervised Transformer for Visual Representation

All Tokens Matter: Token Labeling for Training Better Vision Transformers

Ibot: Image BERT Pre-Training with Online Tokenizer

Robustifying Token Attention for Vision Transformers

Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet.

SeiT++: Masked Token Modeling Improves Storage-efficient Training

A General and Efficient Training for Transformer via Token Expansion

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Boosting Vanilla Lightweight Vision Transformers Via Re-parameterization

A Closer Look at Self-Supervised Lightweight Vision Transformers

BEVT: BERT Pretraining of Video Transformers

Token Selection is a Simple Booster for Vision Transformers

Morphing Tokens Draw Strong Masked Image Models

Vision Transformer with Super Token Sampling

BEiT: BERT Pre-Training of Image Transformers

TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction

Learning Imbalanced Data with Vision Transformers

So-ViT: Mind Visual Tokens for Vision Transformer

MVP: Multimodality-Guided Visual Pre-training