DCMFNet: Deep Cross-Modal Fusion Network for Referring Image Segmentation with Iterative Gated Fusion

Zhen Huang,Mingcheng Xue,Yu Liu,Kaiping Xu,Jiangquan Li,Chenyang Yu
DOI: https://doi.org/10.1145/3670947.3670956
2024-01-01
Abstract:Cross-modal fusion aims to establish a consistent correspondence between arbitrary modalities. Due to the inherent differences between these modalities, accurately modeling their correspondence is a challenging task. Referring image segmentation (RIS) is a fundamental cross-modal task that intends to segment a desired object from an image based on a given natural language expression. In this paper, we propose an efficient algorithm called the Deep Cross-Modal Fusion Network (DCMFNet) to address this challenge. The proposed algorithm leverages the contextual information from linguistic context to guide the modeling of the visual context, gradually highlighting the referent in the image. The network architecture employs an innovative fusion strategy known as Iterative Gated Fusion (IGF) to capture the consistency relationship between multimodal features. IGF iteratively adjusts the relative importance of features at each level based on high-level semantics, emphasizing the shared information while suppressing the irrelevant parts. Specifically, IGF consists of cascaded fusion units and gating units. The fusion units integrate high-level semantics with the features from the previous layer to enhance the representation. The gating units perceive the discrepancy between the enhanced features and the original representation, and selectively weight and integrate the important features for further refinement. Through multi-layer iterative optimization, IGF gradually establishes a fine-grained correspondence between arbitrary modalities. Extensive experimental results on the Referring Image Segmentation task demonstrate the effectiveness and utility of the proposed method.
What problem does this paper attempt to address?