Timeline and Boundary Guided Diffusion Network for Video Shadow Detection

Haipeng Zhou,Honqiu Wang,Tian Ye,Zhaohu Xing,Jun Ma,Ping Li,Qiong Wang,Lei Zhu
DOI: https://doi.org/10.1145/3664647.3681236
2024-08-22
Abstract:Video Shadow Detection (VSD) aims to detect the shadow masks with frame sequence. Existing works suffer from inefficient temporal learning. Moreover, few works address the VSD problem by considering the characteristic (i.e., boundary) of shadow. Motivated by this, we propose a Timeline and Boundary Guided Diffusion (TBGDiff) network for VSD where we take account of the past-future temporal guidance and boundary information jointly. In detail, we design a Dual Scale Aggregation (DSA) module for better temporal understanding by rethinking the affinity of the long-term and short-term frames for the clipped video. Next, we introduce Shadow Boundary Aware Attention (SBAA) to utilize the edge contexts for capturing the characteristics of shadows. Moreover, we are the first to introduce the Diffusion model for VSD in which we explore a Space-Time Encoded Embedding (STEE) to inject the temporal guidance for Diffusion to conduct shadow detection. Benefiting from these designs, our model can not only capture the temporal information but also the shadow property. Extensive experiments show that the performance of our approach overtakes the state-of-the-art methods, verifying the effectiveness of our components. We release the codes, weights, and results at \url{<a class="link-external link-https" href="https://github.com/haipengzhou856/TBGDiff" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve are two key challenges in Video Shadow Detection (VSD): 1. **Insufficient utilization of temporal information**: Existing VSD methods are less efficient in learning temporal correspondences and cannot effectively capture the temporal information in video frame sequences. Specifically, most existing methods rely on additional cues (such as optical flow) to explore the temporal correspondences between adjacent frames, but these methods lack semantic correspondences and are prone to ignoring deformed regions, which are crucial for shadow detection. 2. **Ignoring shadow characteristics**: Few works specifically design models for the characteristics of shadows (such as boundary information). The boundaries of shadows often contain high uncertainty, which poses a challenge to the accurate segmentation of shadows by the model. However, some recent studies have shown that using boundary information can provide potential clues for shadow recognition. To solve the above problems, the authors propose a Timeline and Boundary Guided Diffusion Network (TBGDiff). By introducing the diffusion model and combining temporal information and boundary information, this model aims to perform video shadow detection more effectively. Specifically, the main contributions of TBGDiff include: - **Applying the diffusion model to shadow detection for the first time** and enhancing the temporal understanding ability of the diffusion model through three different temporal guidance methods. - **Designing the Dual Scale Aggregation (DSA) module** to better aggregate temporal features, thereby improving the understanding of short - term consistency and long - term change regions. - **Introducing the Shadow Boundary - Aware Attention (SBAA) mechanism** to help the model more accurately capture shadow boundary information. - **Developing the Space - Time Encoded Embedding (STEE)** to inject temporal guidance information, enabling the diffusion model to better handle the shadow detection task in videos. Through these innovations, TBGDiff can not only capture temporal information but also better understand the characteristics of shadows, thus significantly outperforming existing methods on multiple evaluation metrics.