Dense monocular depth estimation for stereoscopic vision based on pyramid transformer and multi-scale feature fusion

Zhongyi Xia,Tianzhao Wu,Zhuoyan Wang,Man Zhou,Boqi Wu,C. Y. Chan,Ling Bing Kong
DOI: https://doi.org/10.1038/s41598-024-57908-z
IF: 4.6
2024-03-27
Scientific Reports
Abstract:Stereoscopic display technology plays a significant role in industries, such as film, television and autonomous driving. The accuracy of depth estimation is crucial for achieving high-quality and realistic stereoscopic display effects. In addressing the inherent challenges of applying Transformers to depth estimation, the Stereoscopic Pyramid Transformer-Depth (SPT-Depth) is introduced. This method utilizes stepwise downsampling to acquire both shallow and deep semantic information, which are subsequently fused. The training process is divided into fine and coarse convergence stages, employing distinct training strategies and hyperparameters, resulting in a substantial reduction in both training and validation losses. In the training strategy, a shift and scale-invariant mean square error function is employed to compensate for the lack of translational invariance in the Transformers. Additionally, an edge-smoothing function is applied to reduce noise in the depth map, enhancing the model's robustness. The SPT-Depth achieves a global receptive field while effectively reducing time complexity. In comparison with the baseline method, with the New York University Depth V2 (NYU Depth V2) dataset, there is a 10% reduction in Absolute Relative Error (Abs Rel) and a 36% decrease in Root Mean Square Error (RMSE). When compared with the state-of-the-art methods, there is a 17% reduction in RMSE.
multidisciplinary sciences
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the accuracy of monocular depth estimation in stereoscopic display technology. Specifically, the paper proposes a method based on Pyramid Transformer and multi - scale feature fusion - Stereoscopic Pyramid Transformer - Depth (SPT - Depth), to improve the accuracy of depth estimation and thus enhance the effect of stereoscopic display. ### Background and Problem Description - **Background**: Stereoscopic display technology plays an important role in industries such as movies, television, and autonomous driving. The accuracy of depth estimation is crucial for achieving high - quality and realistic stereoscopic display effects. - **Limitations of Existing Methods**: - **Convolutional Neural Networks (CNNs)**: Although they perform well in depth estimation tasks, they have problems such as low - precision prediction of depth maps and unclear structural features. - **Transformer**: Although it can capture global information, it lacks translational invariance in depth estimation tasks, resulting in the loss of spatial information, and has a large number of parameters and high computational cost. ### Main Contributions of the Paper - **SPT - Depth Model**: It combines the advantages of Transformer and CNN, obtains shallow - layer and deep - layer semantic information through step - by - step down - sampling, and performs feature fusion. - **Training Strategy**: The training process is divided into two stages, fine convergence and coarse convergence, and different training strategies and hyper - parameters are adopted, which significantly reduces the training and validation losses. - **Loss Function**: The Scale and Shift - Invariant Mean Squared Error (SSI - MSE) function is introduced to compensate for the lack of translational invariance in Transformer. - **Edge Smoothing Function**: It is used to reduce the noise in the depth map and enhance the robustness of the model. ### Experimental Results - **Performance Improvement**: Compared with the baseline method, on the NYU Depth V2 dataset, SPT - Depth reduces the Absolute Relative Error (Abs Rel) by 10% and the Root Mean Squared Error (RMSE) by 36%. Compared with the state - of - the - art method, the RMSE is reduced by 17%. ### Formula Presentation - **Loss Function**: \[ \text{Loss}=\text{MSE}(y, \hat{y})+\lambda\cdot\text{SSI}(y, \hat{y}) \] where \(y\) is the predicted value, \(\hat{y}\) is the known target value of the depth map, \(\text{MSE}\) is the Mean Squared Error, \(\text{SSI}\) is the scale and shift - invariant function, and \(\lambda\) is the balancing parameter. - **Multi - Head Attention (MHA) and Linear Spatial Reduction Attention (Linear SRA)**: \[ z_{l}=\text{MLP}(\text{LN}(z'_{l})) + z'_{l}, \quad l = 1,\ldots,L \] \[ z'_{l}=\text{SRA}(\text{LN}(z_{l - 1}))+z_{l - 1}, \quad l = 1,\ldots,L \] \[ z_{0}=\{X_{\text{class}}; X_{1}^{\text{PE}}, X_{2}^{\text{PE}}, \ldots, X_{N}^{\text{PE}}\}+X_{\text{pos}}, \quad E\in\mathbb{R}^{(P^{2}\cdot C)\times D}, \quad E_{\text{pos}}\in\mathbb{R}^{(N + 1)\times D} \] \[ y=\text{LN}(z_{0}^{L}) \] ### Summary This paper effectively solves the accuracy problem of monocular depth estimation in stereoscopic display technology by proposing the SPT - Depth model and combining Transformer and CNN.