Token Boosting for Robust Self-Supervised Visual Transformer Pre-training

Tianjiao Li,Lin Geng Foo,Ping Hu,Xindi Shang,Hossein Rahmani,Zehuan Yuan,Jun Liu
2023-04-12
Abstract:Learning with large-scale unlabeled data has become a powerful tool for pre-training Visual Transformers (VTs). However, prior works tend to overlook that, in real-world scenarios, the input data may be corrupted and unreliable. Pre-training VTs on such corrupted data can be challenging, especially when we pre-train via the masked autoencoding approach, where both the inputs and masked ``ground truth" targets can potentially be unreliable in this case. To address this limitation, we introduce the Token Boosting Module (TBM) as a plug-and-play component for VTs that effectively allows the VT to learn to extract clean and robust features during masked autoencoding pre-training. We provide theoretical analysis to show how TBM improves model pre-training with more robust and generalizable representations, thus benefiting downstream tasks. We conduct extensive experiments to analyze TBM's effectiveness, and results on four corrupted datasets demonstrate that TBM consistently improves performance on downstream tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper attempts to solve the problem of how to improve the ability of Visual Transformers (VT) to extract robust features from noisy and unreliable data during self - supervised VT pre - training. Specifically, the paper focuses on the fact that in real - world scenarios, input data often has quality problems and unreliability. For example, images taken in bad weather conditions, depth images with measurement errors, and skeleton data distortion due to sensor noise. These problems can lead to unreliable features learned by VT when using such data for self - supervised pre - training, thus affecting the performance of downstream tasks. ### Solutions To address this challenge, the paper proposes a new module - Token Boosting Module (TBM), which aims to improve the robustness of VT to unreliable and noisy data through the self - supervised pre - training process. TBM is a plug - and - play module that can be inserted into multiple layers of VT and trained in an end - to - end manner. TBM works in the following ways: 1. **Feature enhancement technique**: TBM uses a feature enhancement technique. By adding synthetic noise to the input features and then reconstructing these features through an auto - encoder, more reliable features are estimated. 2. **Theoretical analysis**: The paper provides theoretical analysis, proving that TBM can learn cleaner and more robust features during the self - supervised pre - training process. 3. **Experimental verification**: The paper conducts experiments on multiple tasks, including RGB image classification, 3D skeleton action recognition, and depth image classification. The results show that TBM significantly improves the performance of downstream tasks when dealing with noisy data. ### Main contributions 1. **Designed Token Boosting Module (TBM)**: TBM can improve the robustness of VT to unreliable and noisy data during the self - supervised pre - training process. By inserting TBM into multiple layers, the performance of VT can be further enhanced. 2. **Theoretical analysis**: The paper provides theoretical analysis, proving that TBM can learn more robust and reliable feature representations through self - supervised pre - training. 3. **Experimental verification**: The paper conducts extensive experiments on multiple tasks to verify the effectiveness of TBM, especially its performance when dealing with noisy data. ### Formula presentation Some of the key formulas involved in the paper are as follows: - **Key formula in feature enhancement technique**: \[ \hat{R} = 2\hat{F}-I \] where \(\hat{F}\) is the feature estimate reconstructed from the intermediate representation \(I\) by the auto - encoder \(g\), and \(I\) is the combination of the original feature \(F\) and the synthetic noise \(Q\): \[ I = F+(\alpha\odot S) \] \(\alpha\) is a learnable scaling parameter, and \(S\) is a Gaussian noise vector. - **L2 reconstruction loss**: \[ L_{\text{recon}}(F,\hat{F})=\lambda\sum_{k = 1}^{K}[F_k-\hat{F}_k]^2 \] where \(\lambda\) is a hyperparameter that controls the loss weight. ### Conclusion By introducing the Token Boosting Module (TBM), the paper successfully improves the robustness of Visual Transformers to noisy data during the self - supervised pre - training process. This not only improves the quality of the pre - trained model but also enables VT to perform downstream tasks better when facing unreliable input data.