When CNN meet with ViT: decision-level feature fusion for camouflaged object detection

Yue, Guowen,Jiao, Ge,Li, Chen,Xiang, Jiahao
DOI: https://doi.org/10.1007/s00371-024-03640-8
IF: 2.835
2024-09-27
The Visual Computer
Abstract:Despite the significant advancements in camouflaged object detection achieved by convolutional neural network (CNN) methods and vision transformer (ViT) methods, both have limitations. CNN-based methods fail to explore long-range dependencies due to their limited receptive fields, while ViT-based methods lose detailed information due to large-span aggregation. To address these issues, we introduce a novel model, the double-extraction and triple-fusion network (DTNet), which leverages the global context modeling capabilities of ViT-based encoders and the detail capture capabilities of CNN-based encoders through decision-level feature fusion to make up the respective shortcomings for more complete segmentation of camouflaged objects. Specifically, it incorporates a boundary guidance module, designed to aggregate high-level and low-level boundary information through multi-scale feature decoding, thereby guiding the local detail representation of the transformer. It also includes a global context aggregation module, which shrinks the information of adjacent channels from top to bottom and aggregates information of high-level and low-level scales from bottom to top for feature decoding. It also contains a multi-feature fusion module to fuse global context features and local detail features. This module employs the attention mechanism in different channels to assign varying weights to long-range and short-range information. Through extensive experimentation, it has proven that the DTNet significantly surpasses 20 recently state-of-the-art methods in terms of performance. The related code and datasets will be posted at https://github.com/KungFuProgrammerle/DTNet.
computer science, software engineering
What problem does this paper attempt to address?