Scene Segmentation With Dual Relation-Aware Attention Network

Jun Fu,Jing Liu,Jie Jiang,Yong Li,Yongjun Bao,Hanqing Lu
DOI: https://doi.org/10.1109/tnnls.2020.3006524
IF: 14.255
2021-06-01
IEEE Transactions on Neural Networks and Learning Systems
Abstract:In this article, we propose a Dual Relation-aware Attention Network (DRANet) to handle the task of scene segmentation. How to efficiently exploit context is essential for pixel-level recognition. To address the issue, we adaptively capture contextual information based on the relation-aware attention mechanism. Especially, we append two types of attention modules on the top of the dilated fully convolutional network (FCN), which model the contextual dependencies in spatial and channel dimensions, respectively. In the attention modules, we adopt a self-attention mechanism to model semantic associations between any two pixels or channels. Each pixel or channel can adaptively aggregate context from all pixels or channels according to their correlations. To reduce the high cost of computation and memory caused by the abovementioned pairwise association computation, we further design two types of compact attention modules. In the compact attention modules, each pixel or channel is built into association only with a few numbers of gathering centers and obtains corresponding context aggregation over these gathering centers. Meanwhile, we add a cross-level gating decoder to selectively enhance spatial details that boost the performance of the network. We conduct extensive experiments to validate the effectiveness of our network and achieve new state-of-the-art segmentation performance on four challenging scene segmentation data sets, i.e., Cityscapes, ADE20K, PASCAL Context, and COCO Stuff data sets. In particular, a Mean IoU score of 82.9% on the Cityscapes test set is achieved without using extra coarse annotated data.
computer science, artificial intelligence, theory & methods,engineering, electrical & electronic, hardware & architecture
What problem does this paper attempt to address?
The paper attempts to address the problem of how to efficiently utilize contextual information in the task of scene segmentation. Specifically, the goal of scene segmentation is to assign each pixel in an image to different semantic categories, including objects (such as people, cars, bicycles) and backgrounds (such as sky, road, grass). Due to significant variations in scale, occlusion, and lighting conditions between objects and backgrounds, pixel-level recognition becomes highly challenging. To address these issues, the authors propose a framework called Dual Relation-aware Attention Network (DRANet). By introducing a relation-aware attention mechanism, DRANet can adaptively capture contextual information in both spatial and channel dimensions, thereby improving pixel-level recognition accuracy. The main contributions are as follows: 1. **Propose DRANet**: Achieve accurate pixel-level recognition by modeling contextual dependencies in spatial and channel dimensions. 2. **Design compact attention modules**: Propose two types of compact attention modules (CPAM and CCAM) that enhance performance while reducing computational and memory overhead. 3. **Simple decoder structure**: Design a decoder with a cross-level gating mechanism to enhance low-level features, highlight spatial details, and achieve more accurate predictions. 4. **Experimental validation**: Conduct extensive experiments on four challenging scene segmentation datasets (Cityscapes, ADE20K, PASCAL Context, and COCO Stuff), achieving new state-of-the-art performance. Through these designs, DRANet not only effectively captures rich contextual information but also significantly improves the performance of scene segmentation tasks.