Asymmetric Vision Transformers for Multi-Label Classification

Jie Liu,Yanqi Bao,Jie Wang,Ke Chen,Lei Zhang
DOI: https://doi.org/10.2139/ssrn.4202302
2022-01-01
Abstract:Multi-label image classification (MLIC) aims to distinguish multiple objects within a single image, which is challenging due to the co-occurrence of different objects and cluttered backgrounds. Most of existing works leverage label dependencies and/or object-specific region relationships to build their models. However, these methods usually fail to fully capture the fine-grained details among different regions. Inspired by the success of vision transformer (ViT), we propose an asymmetric vision transformer to tackle the challenge in MLIC task, which optimizes the embedding to fuse global and local information by asymmetric loss. The virtues of local information stepped from attention module in transformer can capture label-related object regions and those semantic-related discriminative regions help us to focus on the details of each single object. While long-range global information will concern on multi-region which depict the whole picture of relations among multiple objects. Specifically, by introducing asymmetric loss function to optimizing global and local embedding matrix, our system can balance the probabilities of positive and negative samples, which is especially important for multi-label classification. Extensive experiments on three multi-label classification datasets (VOC2007, VOC2012, and MS-COCO) well demonstrate the superiority of our approach against other state-of-the-art methods.
What problem does this paper attempt to address?