Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching

Junpeng Jing,Ye Mao,Krystian Mikolajczyk
2024-03-16
Abstract:Dynamic stereo matching is the task of estimating consistent disparities from stereo videos with dynamic objects. Recent learning-based methods prioritize optimal performance on a single stereo pair, resulting in temporal inconsistencies. Existing video methods apply per-frame matching and window-based cost aggregation across the time dimension, leading to low-frequency oscillations at the scale of the window size. Towards this challenge, we develop a bidirectional alignment mechanism for adjacent frames as a fundamental operation. We further propose a novel framework, BiDAStereo, that achieves consistent dynamic stereo matching. Unlike the existing methods, we model this task as local matching and global aggregation. Locally, we consider correlation in a triple-frame manner to pool information from adjacent frames and improve the temporal consistency. Globally, to exploit the entire sequence's consistency and extract dynamic scene cues for aggregation, we develop a motion-propagation recurrent unit. Extensive experiments demonstrate the performance of our method, showcasing improvements in prediction quality and achieving state-of-the-art results on various commonly used benchmarks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the issue of consistent disparity estimation in dynamic stereo videos. Specifically: 1. **Problems with Existing Methods**: - Current deep learning-based methods mainly focus on optimal performance on single stereo image pairs, leading to a lack of consistency in the temporal dimension. - Existing video methods use frame-by-frame matching and apply sliding window cost aggregation in the temporal dimension, resulting in low-frequency oscillations. 2. **Research Objectives**: - Propose a new framework, BiDAStereo, to achieve consistent disparity estimation in dynamic stereo videos through a bidirectional alignment mechanism. - This method combines local matching (three-frame correlation layer) and global aggregation (motion propagation recurrent unit) to fully utilize information from the entire sequence, thereby improving disparity estimation accuracy in dynamic scenes. 3. **Main Contributions**: - Developed a bidirectional alignment mechanism to enforce temporal consistency in dynamic stereo vision. - Proposed a three-frame correlation layer to align adjacent frames and construct cost volumes, extracting local temporal receptive field information. - Introduced a new motion propagation recurrent unit to leverage global temporal information in dynamic scenes. - Achieved state-of-the-art performance in various benchmark tests.