Abstract:While recent deep learning-based stereo-matching networks have shown outstanding advances, there are still some unsolved challenges. First, most state-of-the-art stereo models employ 3D convolutions for 4D cost volume aggregation, which limit the deployment of networks for resource-limited mobile environments owing to heavy consumption of computation and memory. Although there are some efficient networks, most of them still require a heavy computational cost to incorporate them to mobile computing devices in real-time. Second, most stereo networks indirectly supervise cost volumes through disparity regression loss by using the softargmax function. This causes problems in ambiguous regions, such as the boundaries of objects, because there are many possibilities for unreasonable cost distributions which result in overfitting problem. A few works deal with this problem by generating artificial cost distribution using only the ground truth disparity value that is insufficient to fully regularize the cost volume. To address these problems, we first propose an efficient multi-scale sequential feature fusion network (MSFFNet). Specifically, we connect multi-scale SFF modules in parallel with a cross-scale fusion function to generate a set of cost volumes with different scales. These cost volumes are then effectively combined using the proposed interlaced concatenation method. Second, we propose an adaptive cost-volume-filtering (ACVF) loss function that directly supervises our estimated cost volume. The proposed ACVF loss directly adds constraints to the cost volume using the probability distribution generated from the ground truth disparity map and that estimated from the teacher network which achieves higher accuracy. Results of several experiments using representative datasets for stereo matching show that our proposed method is more efficient than previous methods. Our network architecture consumes fewer parameters and generates reasonable disparity maps with faster speed compared with the existing state-of-the art stereo models. Concretely, our network achieves 1.01 EPE with runtime of 42 ms, 2.92M parameters, and 97.96G FLOPs on the Scene Flow test set. Compared with PSMNet, our method is 89% faster and 7% more accurate with 45% fewer parameters.

CVE-Net: Cost Volume Enhanced Network Guided by Sparse Features for Stereo Matching.

Stereo Matching Using Multi-Level Cost Volume and Multi-Scale Feature Constancy

CVCNet: Learning Cost Volume Compression for Efficient Stereo Matching

Ghost-Stereo: GhostNet-based Cost Volume Enhancement and Aggregation for Stereo Matching Networks

Multi-Scale Cost Volumes Cascade Network for Stereo Matching

SCV-Stereo: Learning Stereo Matching from a Sparse Cost Volume

Exploiting Semantic and Boundary Information for Stereo Matching

Edge supervision and multi-scale cost volume for stereo matching

Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo Matching Networks

A Convolutional Attention Residual Network for Stereo Matching

Robust Stereo Matching Using Discriminative Multilevel Features and Multimodal Bifurcated Cost Volume Network

A Light-Weight Network with Multi-Scale Features Fusion and Color Guidance for Stereo Matching

Adaptive Cost Volume Representation for Unsupervised High-resolution Stereo Matching

A Light-Weight Stereo Matching Network Based on Multi-Scale Features Fusion and Robust Disparity Refinement

Cascaded Feature Interaction Network for Stereo Matching

End-to-End Learning of Multi-scale Convolutional Neural Network for Stereo Matching

Stereo Matching with Cost Volume based Sparse Disparity Propagation

Efficient Multi-Scale Stereo-Matching Network Using Adaptive Cost Volume Filtering

Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation

Bidirectional Stereo Matching Network With Double Cost Volumes

Stereo Matching Method for Remote Sensing Images Based on Attention and Scale Fusion