Camouflaged Object Segmentation with Transformer

h wang,x wang,f sun,y song
DOI: https://doi.org/10.1007/978-981-16-9247-5_17
2021-01-01
Abstract:The Vision Transformer (ViT) [6] directly applies a Transformer architecture to image classification and achieves an impressive result compared with convolutional networks. This paper presents a new ViT-base camouflaged object segmentation method, called COS Transformer, which aims to identify and segment objects concealed in a complex environment. The high intrinsic similarities between object and surrounding makes the task challenging than salient object detection. Most recent camouflaged object segmentation methods(e.g., EGNet [29], PraNet [10] and SINet [9]) adopt convolutional network with an encoder-decoder architecture and focused on increasing the receptive field, which is limited by the depth of the network. In camouflaged object segmentation (COS) task, the camouflage is mainly relied on contrast of the whole surrounding instead of the local information. We introduce transformer with global context awareness in this paper, for self-attention allowing COS Transformer to aggregate features globally even in the lowest layers. Specifically, the architecture is composed of a transformer-based encoder and a multi-layers feature aggregation refinement module. After training on the COD10K [9] dataset, COS Transformer attains excellent results compared to state-of-the-art convolutional networks, e.g. 11.7% improvement of $$E_\phi $$ [8] on the COD10K contrasted to SINet.
What problem does this paper attempt to address?