SemDM: Task-oriented masking strategy for self-supervised visual learning

Xin Ma,Haonan Cheng,Long Ye
DOI: https://doi.org/10.1016/j.displa.2023.102439
IF: 3.074
2023-01-01
Displays
Abstract:In this paper, we propose a novel learning scheme for better combining masked image modeling (MIM) and instance discrimination (ID). Motivated by compensating the requirement gap of masking strength between MIM and ID, we propose Semantic Disjoint Masking (SemDM), which decomposes the masking into two manners: preserving the majority of key patterns in images for ID, while dropping out most of them for MIM. Specifically, we utilize attention-guided masking in ID to help keeping the identity of object in image for encoder. While in MIM, we conversely only leave some hints about the object. Then these generated masked views only perform their specified learning task, facilitating more suitable visual priors to be learned in each learning task. Moreover, we introduce product quantization (PQ) to optimize the concept distributions in latent space, which guarantees that a compact set of meaningful visual concepts can be learned. Extensive experiments demonstrate that our method bootstraps meaningful visual concepts to guide visual understanding, and obtains state-of-the-art results on ImageNet-100.
What problem does this paper attempt to address?