Feature boosting with efficient attention for scene parsing

Vivek Singh,Shailza Sharma,Fabio Cuzzolin
2024-02-29
Abstract:The complexity of scene parsing grows with the number of object and scene classes, which is higher in unrestricted open scenes. The biggest challenge is to model the spatial relation between scene elements while succeeding in identifying objects at smaller scales. This paper presents a novel feature-boosting network that gathers spatial context from multiple levels of feature extraction and computes the attention weights for each level of representation to generate the final class labels. A novel `channel attention module' is designed to compute the attention weights, ensuring that features from the relevant extraction stages are boosted while the others are attenuated. The model also learns spatial context information at low resolution to preserve the abstract spatial relationships among scene elements and reduce computation cost. Spatial attention is subsequently concatenated into a final feature set before applying feature boosting. Low-resolution spatial attention features are trained using an auxiliary task that helps learning a coarse global scene structure. The proposed model outperforms all state-of-the-art models on both the ADE20K and the Cityscapes datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on addressing the challenging problem of scene parsing, especially when dealing with complex and unconstrained natural scenes, and how to effectively recognize multiple object categories and their spatial relationships at different scales. Specifically, the research proposes a novel Feature Boosting Network (FBNet), aimed at capturing rich contextual information through multi-level feature extraction and enhancing model performance through two attention mechanisms—the Channel Attention Module (CAM) and the Spatial Attention Module (SAM). The main contributions of FBNet include: 1. Proposing a new feature extraction method that can learn and utilize multi-scale spatial contextual information for scene parsing. 2. Designing a novel channel attention module to learn the individual contribution of each feature to the final semantic labels; additionally, a simplified spatial attention module is designed to efficiently extract relevant attention matrices. 3. Introducing a new learning mechanism that uses low-resolution semantic maps as an auxiliary task to improve the training of the spatial attention module. 4. Demonstrating significantly superior performance over existing techniques on the ADE20K and Cityscapes benchmark datasets, while maintaining a lower parameter count. Through these methods, FBNet achieves better performance in handling complex scenes, especially when dealing with multiple object categories and objects of different scales.