Improving Visual Representations of Masked Autoencoders With Artifacts Suppression

Zhengwei Miao,Hui Luo,Dongxu Liu,Jianlin Zhang
DOI: https://doi.org/10.1109/lsp.2024.3458792
2024-10-04
IEEE Signal Processing Letters
Abstract:Recently, Masked Autoencoders (MAE) have gained attention for their abilities to generate visual representations efficiently through pretext tasks. However, there has been little research evaluating the visual representations obtained by pre-trained MAE during the fine-tuning process. In this study, we address the gap by examining the attention maps within each block of the pre-trained MAE during the fine-tuning process. We observed artifacts in pre-trained models, which appear as significant responses in the attention maps of shallow blocks. These artifacts may negatively impact the transfer ability performance of MAE. To address this issue, we localize the cause of these artifacts to the asymmetry between the pre-training and fine-tuning processes. To suppress these artifacts, we propose a novel semantic masking strategy. This strategy aims to preserve complete and continuous semantic information within visible patches while maintaining randomness to facilitate robust representation learning. Experimental results demonstrate that the proposed masking strategy improves the performance of various downstream tasks while reducing artifacts. Specifically, we observed a 3.2% improvement in linear probing, a 0.5% enhancement in fine-tuning on Imagenet1K, and a 0.6% increase in semantic segmentation on ADE20K.
engineering, electrical & electronic
What problem does this paper attempt to address?