Abstract:Recently, referring image segmentation has attracted wide attention given its huge potential in human-robot interaction. Networks to identify the referred region must have a deep understanding of both the image and language semantics. To do so, existing works tend to design various mechanisms to achieve cross-modality fusion, for example, tile and concatenation and vanilla nonlocal manipulation. However, the plain fusion usually is either coarse or constrained by the exorbitant computation overhead, finally causing not enough understanding of the referent. In this work, we propose a fine-grained semantic funneling infusion (FSFI) mechanism to solve the problem. The FSFI introduces a constant spatial constraint on the querying entities from different encoding stages and dynamically infuses the gleaned language semantic into the vision branch. Moreover, it decomposes the features from different modalities into more delicate components, allowing the fusion to happen in multiple low-dimensional spaces. The fusion is more effective than the one only happening in one high-dimensional space, given its ability to sink more representative information along the channel dimension. Another problem haunting the task is that the instilling of high-abstract semantic will blur the details of the referent. Targetedly, we propose a multiscale attention-enhanced decoder (MAED) to alleviate the problem. We design a detail enhancement operator (DeEh) and apply it in a multiscale and progressive way. Features from the higher level are used to generate attention guidance to enlighten the lower-level features to more attend to the detail regions. Extensive results on the challenging benchmarks show that our network performs favorably against the state-of-the-arts (SOTAs).

Sum-Fusion and Cascaded Interpolation for Semantic Image Segmentation

Research of improving semantic image segmentation based on a feature fusion model

Semantic Segmentation via Highly Fused Convolutional Network with Multiple Soft Cost Functions

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

SDFuse: Semantic-injected Dual-Flow Learning for Infrared and Visible Image Fusion

Multi-layer Adaptive Feature Fusion for Semantic Segmentation

Semantic Image Segmentation with Improved Position Attention and Feature Fusion

Referring Image Segmentation with Fine-Grained Semantic Funneling Infusion

ExFuse: Enhancing Feature Fusion for Semantic Segmentation

CIMFNet: Cross-layer Interaction and Multiscale Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Images

Semantic-Aware Fusion Network Based on Super-Resolution

A Multi-Step Fusion Network for Semantic Segmentation of High-Resolution Aerial Images

A Multi-level Feature Fusion Network for Real-time Semantic Segmentation

Enhancing Feature Fusion with Spatial Aggregation and Channel Fusion for Semantic Segmentation

Image Semantic Segmentation Fusion of Edge Detection and AFF Attention Mechanism

Adaptive fusion with multi-scale features for interactive image segmentation

EfficientFusion: simple and efficient learning with pixel-level fusion for semantic segmentation

Feature Fusion Network Based on Hybrid Attention for Semantic Segmentation

SIFusion: Lightweight infrared and visible image fusion based on semantic injection

Context and Boundary Guided Multi-Scale Feature Fusion Network for Semantic Segmentation

Real-Time Semantic Segmentation via Multiply Spatial Fusion Network