Abstract:Recently, concerning the challenge of collecting large-scale explicitly annotated videos, weakly supervised video object segmentation (WSVOS) using video tags has attracted much attention. Existing WSVOS approaches follow a general pipeline including two phases, i.e., a pseudo masks generation phase and a refinement phase. To explore the intrinsic property and correlation buried in the video frames, most of them focus on the later phase by introducing optical flow as temporal information to provide more supervision. However, these optical flow-based studies are greatly affected by illumination and distortion and lack consideration of the discriminative capacity of multi-level deep features. In this article, with the goal of capturing more effective temporal information and investigating a temporal information fusion strategy accordingly, we propose a unified WSVOS model by adopting a two-branch architecture with a multi-level cross-branch fusion strategy, named as dual-attention cross-branch fusion network (DACF-Net). Concretely, the two branches of DACF-Net, i.e., a temporal prediction subnetwork (TPN) and a spatial segmentation subnetwork (SSN), are used for extracting temporal information and generating predicted segmentation masks, respectively. To perform the cross-branch fusion between TPN and SSN, we propose a dual-attention fusion module that can be plugged into the SSN flexibly. We also pose a cross-frame coherence loss (CFCL) to achieve smooth segmentation results by exploiting the coherence of masks produced by TPN and SSN. Extensive experiments demonstrate the effectiveness of proposed approach compared with the state-of-the-arts on two challenging datasets, i.e., Davis-2016 and YouTube-Objects.

Fusion target attention mask generation network for video segmentation

Fast Real-Time Video Object Segmentation with a Tangled Memory Network

Deep Dual-Stream Network with Scale Context Selection Attention Module for Semantic Segmentation

FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos

Optical Flow-Guided Mask Generation Network For Video Segmentation

Video object segmentation via couple streams and feature memory

Fast Video Object Segmentation Via Dynamic Targeting Network

Bi-directional Attention Feature Enhancement for Video Instance Segmentation.

Full-duplex strategy for video object segmentation

MA-ResNet50: A General Encoder Network for Video Segmentation.

Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion

Dual Cross-Attention for Video Object Segmentation Via Uncertainty Refinement

Motion-Guided Spatial Time Attention for Video Object Segmentation.

Semantic Image Segmentation with Improved Position Attention and Feature Fusion

Spatial attention-guided deformable fusion network for salient object detection

F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation

Feature Fusion Network Based on Hybrid Attention for Semantic Segmentation

Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation

Spatiotemporal Graph Neural Network Based Mask Reconstruction for Video Object Segmentation

Video Object Segmentation via Structural Feature Reconfiguration

DARSegNet: A Real-Time Semantic Segmentation Method Based on Dual Attention Fusion Module and Encoder-Decoder Network