Kernel Masked Image Modeling Through the Lens of Theoretical Understanding

Yurui Qian,Yu Wang,Jingjing Zou,Jingyang Lin,Yingwei Pan,Ting Yao,Qibin Sun,Tao Mei
DOI: https://doi.org/10.1109/TNNLS.2024.3443088
2024-08-27
Abstract:Masked image modeling (MIM) has been considered as the state-of-the-art (SOTA) self-supervised learning (SSL) technique in terms of visual pretraining. The impressive generalization ability of MIM also paves the way for the remarkable success of large-scale vision foundation models. In this article, we further discuss the validity and advantages of implementing MIM techniques in the reproducing kernel Hilbert spaces (RKHSs) and we associate the analysis with a novel MIM method named R-MIM (short for RKHS-MIM). Through the careful construction of an augmentation graph and by using spectral decomposition techniques, we establish a systematic theoretical understanding between the proposed R-MIM's generalization ability and the choice of kernel function used during training. Specifically, we reach a conclusion that both of the local Lipschitz constant of the resultant R-MIM model and the corresponding expected pretraining error can have a strong composite effect on bounding downstream task error, depending on the kernel options. We demonstrate that under mild mathematical assumptions, R-MIM method is guaranteed to return a lower bound on downstream tasks in comparison to vanilla MIM techniques, such as masked autoencoder (MAE) and SimMIM. Empirical justification well corroborates our theoretical hypothesis and analysis in showing the superior generalization of the proposed R-MIM and the theoretical link to kernel choices. The code is available at: https://github.com/yurui-q/R-MIM.
What problem does this paper attempt to address?