Proxy Probing Decoder for Weakly Supervised Object Localization: A Baseline Investigation

Jingyuan Xu,Hongtao Xie,Chuanbin Liu,Yongdong Zhang
DOI: https://doi.org/10.1145/3503161.3547945
2022-01-01
Abstract:Weakly supervised object localization (WSOL) aims to localize the object with only image category labels. Existing methods generally fine-tune the models with manually selected training epochs and subjective loss functions to mitigate the partial activation problem of the classification-based model. However, such fine-tuning scheme would cause the model to degrade, e.g. affect the classification performance and generalization capabilities of the pre-trained model. In this paper, we propose a novel method named Proxy Probing Decoder (PPD) to meet these challenges, which utilizes the segmentation property of self-attention map in the self-supervised vision transformer and breaks through model fine-tuning with a novel proxy probing decoder. Specifically, we utilize the self-supervised vision transformer to capture long-range dependencies and avoid partial activation. Then we simply adopt a proxy consisting of a series of decoding layers to transform the feature representations into the heatmap of the objects' foreground and conduct localization. The backbone parameters are frozen during training while the proxy is used to decode the feature and localize the object. In this way, the vision transformer model can maintain the feature representation capabilities and only the proxy is required for adapting to the task. Without bells and whistles, our framework achieves 55.0% Top-1 Loc on the ILSVRC2012 dataset and 78.8% Top-1 Loc on the CUB-200-2011 dataset, which surpasses state-of-the-art by a large margin and provides a simple baseline. Codes and models will be available on Github.
What problem does this paper attempt to address?