Distillation and Supplementation of Features for Referring Image Segmentation

Zeyu Tan,Dahong Xu,Xi Li,Hong Liu
DOI: https://doi.org/10.1109/access.2024.3482108
IF: 3.9
2024-01-01
IEEE Access
Abstract:Referring Image Segmentation (RIS) aims to accurately match specific instance objects in an input image with natural language expressions and generate corresponding pixel-level segmentation masks. Existing methods typically obtain multi-modal features by fusing linguistic features with visual features, which are fed into a mask decoder to generate segmentation masks. However, these methods ignore interfering noise in the multi-modal features that will adversely affect the generation of the target segmentation masks. In addition, most of the current RIS models only incorporate a residual structure derived from the block in the Transformer model, and this limitation of information transfer makes it difficult to form a hierarchical structure of the model, which in turn affects the training effect of the model. In this paper, we propose a RIS method called DSFRIS, which combines the knowledge of sparse reconstruction and employs a novel training mechanism in the process of training the decoder. Specifically, we propose a feature distillation mechanism for the multi-modal feature fusion stage and a feature supplementation mechanism for the mask decoder training process, which are two novel mechanisms for reducing the noise information in the multi-modal fusion features and enriching the feature information in the decoder training process, respectively. Through extensive experiments on three widely used RIS benchmark datasets, we demonstrate the state-of-the-art performance of our proposed method.
What problem does this paper attempt to address?