Abstract:Traditional shadow detectors often identify all shadow regions of static images or video sequences. This work presents the Referring Video Shadow Detection (RVSD), which is an innovative task that rejuvenates the classic paradigm by facilitating the segmentation of particular shadows in videos based on descriptive natural language prompts. This novel RVSD not only achieves segmentation of arbitrary shadow areas of interest based on descriptions (flexibility) but also allows users to interact with visual content more directly and naturally by using natural language prompts (interactivity), paving the way for abundant applications ranging from advanced video editing to virtual reality experiences. To pioneer the RVSD research, we curated a well-annotated RVSD dataset, which encompasses 86 videos and a rich set of 15,011 paired textual descriptions with corresponding shadows. To the best of our knowledge, this dataset is the first one for addressing RVSD. Based on this dataset, we propose a Referring Shadow-Track Memory Network (RSM-Net) for addressing the RVSD task. In our RSM-Net, we devise a Twin-Track Synergistic Memory (TSM) to store intra-clip memory features and hierarchical inter-clip memory features, and then pass these memory features into a memory read module to refine features of the current video frame for referring shadow detection. We also develop a Mixed-Prior Shadow Attention (MSA) to utilize physical priors to obtain a coarse shadow map for learning more visual features by weighting it with the input video frame. Experimental results show that our RSM-Net achieves state-of-the-art performance for RVSD with a notable Overall IOU increase of 4.4\%. Our code and dataset are available at <a class="link-external link-https" href="https://github.com/whq-xxh/RVSD" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of flexibility and interactivity in traditional shadow detection methods when processing videos. Specifically, traditional shadow detection methods usually focus on identifying all shadow areas in static images or video sequences, but are unable to segment specific shadows according to users' natural - language descriptions. This limits their application potential in the era of advanced multimedia interaction. ### Specific description of the problem 1. **Lack of flexibility**: Traditional methods cannot flexibly segment specific shadow areas according to users' needs. 2. **Lack of interactivity**: Users cannot directly interact with the system through natural language to specify the shadow parts they are interested in. 3. **Limited application scenarios**: Due to the lack of flexibility and interactivity, traditional methods are difficult to be applied to scenarios that require precise control of shadows, such as advanced video editing and virtual reality. ### New tasks proposed in the paper To solve the above problems, the paper proposes a new task named "Referring Video Shadow Detection (RVSD) driven by language". RVSD aims to interactively segment specific shadow areas in videos through natural - language descriptions. This new task has the following characteristics: - **Flexibility**: It can flexibly segment any shadow area of interest according to natural - language descriptions. - **Interactivity**: It allows users to interact with visual content more directly and naturally through natural - language prompts. - **Broad application prospects**: From advanced video editing to virtual reality experiences, RVSD can significantly improve the user experience of these applications. ### Solutions To achieve the RVSD task, the authors have made the following contributions: 1. **Construct the first RVSD dataset**: This dataset contains 86 videos and 15,011 pairs of text descriptions and their corresponding shadow annotations, covering a wide range of scenarios and dynamically changing shadows. 2. **Propose the RSM - Net model**: This model includes Twin - Track Synergistic Memory (TSM) and Mixed - Prior Shadow Attention (MSA) modules, which are used to store and utilize temporal information and physical prior knowledge, so as to accurately identify specific shadows. ### Main technical details - **Twin - Track Synergistic Memory (TSM)**: It is used to store intra - frame and inter - frame temporal features, helping the network better understand shadow changes in videos. - **Mixed - Prior Shadow Attention (MSA)**: It uses physical prior knowledge to generate a rough shadow map, guiding the network to focus on potential shadow areas. Through these innovations, the paper demonstrates the superior performance of RSM - Net in the RVSD task, achieving a higher Overall IOU (Intersection over Union) than existing methods. ### Summary By introducing the RVSD task and related technologies, this paper solves the deficiencies of traditional shadow detection methods in terms of flexibility and interactivity, providing new directions and tools for future research and practical applications.

Language-Driven Interactive Shadow Detection

Attention Res-Unet: an Efficient Shadow Detection Algorithm

Video Instance Shadow Detection Under the Sun and Sky

Triple-cooperative Video Shadow Detection

Learning Physical-Spatio-Temporal Features for Video Shadow Removal

Detect Any Shadow: Segment Anything for Video Shadow Detection

RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation

MRPFA-Net for Shadow Detection in Remote-Sensing Images

Exploring Better Target for Shadow Detection.

Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images

Learning Referring Video Object Segmentation from Weak Annotation

CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation

Salient Object Detection in RGB-D Videos

Timeline and Boundary Guided Diffusion Network for Video Shadow Detection

Two-stage Visual Cues Enhancement Network for Referring Image Segmentation

SCOTCH and SODA: A Transformer Video Shadow Detection Framework

Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

Temporally Consistent Referring Video Object Segmentation with Hybrid Memory

ReferEverything: Towards Segmenting Everything We Can Speak of in Videos