BiCSNet: A Bidirectional Cross-Scale Backbone for Recognition and Localization

Song Peng,Zhenfeng Shao,Xiao Huang,Yi Zhu,Ruiqian Zhang,Junwei Zha
DOI: https://doi.org/10.1109/tcsvt.2021.3138743
IF: 5.859
2021-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Recognition and localization models can be generally decomposed into three components: encoder, decoder, and task head. In this paper, we rethink the necessity of decoder, as we observe that it brings additional computational and parametric burden. We thus propose to remove the decoder and present a bidirectional cross-scale architecture that is able to obtain rich semantic information and precise localization in a unified backbone. Extensive experiments demonstrate that, different from common encoder-decoder models and other down-sampling and up-sampling backbones, the proposed BiCSNet achieves improved performances compared to existing architectures for pixel-level tasks. In object detection, our BiCSNet brings significant performance improvement by ~ 3% AP at various scales with 13% – 23% fewer FLOPS, compared with ResNet-FPN models on COCO dataset. In Instance segmentation, the AP can be improved by 1% over SpineNet. BiCSNet is also promising for semantic segmentation tasks, as the proposed BiCSNet pre-trained on ImageNet alone significantly outperforms DeepLabv3 pre-trained on both ImageNet and COCO dataset by 1.3% in mIOU with 89% fewer FLOPs on PASCAL VOC 2012.
What problem does this paper attempt to address?