Language-Driven Interactive Shadow Detection

Hongqiu Wang,Wei Wang,Haipeng Zhou,Huihui Xu,Shaozhi Wu,Lei Zhu
DOI: https://doi.org/10.1145/3664647.3681192
2024-08-16
Abstract:Traditional shadow detectors often identify all shadow regions of static images or video sequences. This work presents the Referring Video Shadow Detection (RVSD), which is an innovative task that rejuvenates the classic paradigm by facilitating the segmentation of particular shadows in videos based on descriptive natural language prompts. This novel RVSD not only achieves segmentation of arbitrary shadow areas of interest based on descriptions (flexibility) but also allows users to interact with visual content more directly and naturally by using natural language prompts (interactivity), paving the way for abundant applications ranging from advanced video editing to virtual reality experiences. To pioneer the RVSD research, we curated a well-annotated RVSD dataset, which encompasses 86 videos and a rich set of 15,011 paired textual descriptions with corresponding shadows. To the best of our knowledge, this dataset is the first one for addressing RVSD. Based on this dataset, we propose a Referring Shadow-Track Memory Network (RSM-Net) for addressing the RVSD task. In our RSM-Net, we devise a Twin-Track Synergistic Memory (TSM) to store intra-clip memory features and hierarchical inter-clip memory features, and then pass these memory features into a memory read module to refine features of the current video frame for referring shadow detection. We also develop a Mixed-Prior Shadow Attention (MSA) to utilize physical priors to obtain a coarse shadow map for learning more visual features by weighting it with the input video frame. Experimental results show that our RSM-Net achieves state-of-the-art performance for RVSD with a notable Overall IOU increase of 4.4\%. Our code and dataset are available at <a class="link-external link-https" href="https://github.com/whq-xxh/RVSD" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the lack of flexibility and interactivity in traditional shadow detection methods when processing videos. Specifically, traditional shadow detection methods usually focus on identifying all shadow areas in static images or video sequences, but are unable to segment specific shadows according to users' natural - language descriptions. This limits their application potential in the era of advanced multimedia interaction. ### Specific description of the problem 1. **Lack of flexibility**: Traditional methods cannot flexibly segment specific shadow areas according to users' needs. 2. **Lack of interactivity**: Users cannot directly interact with the system through natural language to specify the shadow parts they are interested in. 3. **Limited application scenarios**: Due to the lack of flexibility and interactivity, traditional methods are difficult to be applied to scenarios that require precise control of shadows, such as advanced video editing and virtual reality. ### New tasks proposed in the paper To solve the above problems, the paper proposes a new task named "Referring Video Shadow Detection (RVSD) driven by language". RVSD aims to interactively segment specific shadow areas in videos through natural - language descriptions. This new task has the following characteristics: - **Flexibility**: It can flexibly segment any shadow area of interest according to natural - language descriptions. - **Interactivity**: It allows users to interact with visual content more directly and naturally through natural - language prompts. - **Broad application prospects**: From advanced video editing to virtual reality experiences, RVSD can significantly improve the user experience of these applications. ### Solutions To achieve the RVSD task, the authors have made the following contributions: 1. **Construct the first RVSD dataset**: This dataset contains 86 videos and 15,011 pairs of text descriptions and their corresponding shadow annotations, covering a wide range of scenarios and dynamically changing shadows. 2. **Propose the RSM - Net model**: This model includes Twin - Track Synergistic Memory (TSM) and Mixed - Prior Shadow Attention (MSA) modules, which are used to store and utilize temporal information and physical prior knowledge, so as to accurately identify specific shadows. ### Main technical details - **Twin - Track Synergistic Memory (TSM)**: It is used to store intra - frame and inter - frame temporal features, helping the network better understand shadow changes in videos. - **Mixed - Prior Shadow Attention (MSA)**: It uses physical prior knowledge to generate a rough shadow map, guiding the network to focus on potential shadow areas. Through these innovations, the paper demonstrates the superior performance of RSM - Net in the RVSD task, achieving a higher Overall IOU (Intersection over Union) than existing methods. ### Summary By introducing the RVSD task and related technologies, this paper solves the deficiencies of traditional shadow detection methods in terms of flexibility and interactivity, providing new directions and tools for future research and practical applications.