Abstract:Stereoscopic display technology plays a significant role in industries, such as film, television and autonomous driving. The accuracy of depth estimation is crucial for achieving high-quality and realistic stereoscopic display effects. In addressing the inherent challenges of applying Transformers to depth estimation, the Stereoscopic Pyramid Transformer-Depth (SPT-Depth) is introduced. This method utilizes stepwise downsampling to acquire both shallow and deep semantic information, which are subsequently fused. The training process is divided into fine and coarse convergence stages, employing distinct training strategies and hyperparameters, resulting in a substantial reduction in both training and validation losses. In the training strategy, a shift and scale-invariant mean square error function is employed to compensate for the lack of translational invariance in the Transformers. Additionally, an edge-smoothing function is applied to reduce noise in the depth map, enhancing the model's robustness. The SPT-Depth achieves a global receptive field while effectively reducing time complexity. In comparison with the baseline method, with the New York University Depth V2 (NYU Depth V2) dataset, there is a 10% reduction in Absolute Relative Error (Abs Rel) and a 36% decrease in Root Mean Square Error (RMSE). When compared with the state-of-the-art methods, there is a 17% reduction in RMSE.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the accuracy of monocular depth estimation in stereoscopic display technology. Specifically, the paper proposes a method based on Pyramid Transformer and multi - scale feature fusion - Stereoscopic Pyramid Transformer - Depth (SPT - Depth), to improve the accuracy of depth estimation and thus enhance the effect of stereoscopic display. ### Background and Problem Description - **Background**: Stereoscopic display technology plays an important role in industries such as movies, television, and autonomous driving. The accuracy of depth estimation is crucial for achieving high - quality and realistic stereoscopic display effects. - **Limitations of Existing Methods**: - **Convolutional Neural Networks (CNNs)**: Although they perform well in depth estimation tasks, they have problems such as low - precision prediction of depth maps and unclear structural features. - **Transformer**: Although it can capture global information, it lacks translational invariance in depth estimation tasks, resulting in the loss of spatial information, and has a large number of parameters and high computational cost. ### Main Contributions of the Paper - **SPT - Depth Model**: It combines the advantages of Transformer and CNN, obtains shallow - layer and deep - layer semantic information through step - by - step down - sampling, and performs feature fusion. - **Training Strategy**: The training process is divided into two stages, fine convergence and coarse convergence, and different training strategies and hyper - parameters are adopted, which significantly reduces the training and validation losses. - **Loss Function**: The Scale and Shift - Invariant Mean Squared Error (SSI - MSE) function is introduced to compensate for the lack of translational invariance in Transformer. - **Edge Smoothing Function**: It is used to reduce the noise in the depth map and enhance the robustness of the model. ### Experimental Results - **Performance Improvement**: Compared with the baseline method, on the NYU Depth V2 dataset, SPT - Depth reduces the Absolute Relative Error (Abs Rel) by 10% and the Root Mean Squared Error (RMSE) by 36%. Compared with the state - of - the - art method, the RMSE is reduced by 17%. ### Formula Presentation - **Loss Function**: \[ \text{Loss}=\text{MSE}(y, \hat{y})+\lambda\cdot\text{SSI}(y, \hat{y}) \] where \(y\) is the predicted value, \(\hat{y}\) is the known target value of the depth map, \(\text{MSE}\) is the Mean Squared Error, \(\text{SSI}\) is the scale and shift - invariant function, and \(\lambda\) is the balancing parameter. - **Multi - Head Attention (MHA) and Linear Spatial Reduction Attention (Linear SRA)**: \[ z_{l}=\text{MLP}(\text{LN}(z'_{l})) + z'_{l}, \quad l = 1,\ldots,L \] \[ z'_{l}=\text{SRA}(\text{LN}(z_{l - 1}))+z_{l - 1}, \quad l = 1,\ldots,L \] \[ z_{0}=\{X_{\text{class}}; X_{1}^{\text{PE}}, X_{2}^{\text{PE}}, \ldots, X_{N}^{\text{PE}}\}+X_{\text{pos}}, \quad E\in\mathbb{R}^{(P^{2}\cdot C)\times D}, \quad E_{\text{pos}}\in\mathbb{R}^{(N + 1)\times D} \] \[ y=\text{LN}(z_{0}^{L}) \] ### Summary This paper effectively solves the accuracy problem of monocular depth estimation in stereoscopic display technology by proposing the SPT - Depth model and combining Transformer and CNN.

Dense monocular depth estimation for stereoscopic vision based on pyramid transformer and multi-scale feature fusion

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Monocular Depth Estimation Based on Multi-Scale Graph Convolution Networks

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

Towards Comprehensive Monocular Depth Estimation: Multiple Heads are Better Than One

Monocular Depth Estimation Based on Dilated Convolutions and Feature Fusion

Robust Depth Estimation Based on Parallax Attention for Aerial Scene Perception

Depth Estimation from Monocular Images Using Dilated Convolution and Uncertainty Learning.

Boosting Monocular Depth Estimation with Sparse Guided Points

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

A Contour-Aware Monocular Depth Estimation Network using Swin Transformer and Cascaded Multi-scale Fusion

PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation

Monocular Depth Estimation Algorithm Integrating Parallel Transformer and Multi-Scale Features

AMENet is a monocular depth estimation network designed for automatic stereoscopic display

An Extremely Effective Spatial Pyramid and Pixel Shuffle Upsampling Decoder for Multiscale Monocular Depth Estimation

Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers

Depth Estimation Algorithm Based on Transformer-Encoder and Feature Fusion

DELTAS: Depth Estimation by Learning Triangulation And densification of Sparse points

Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance

Edge-Assisted Epipolar Transformer for Industrial Scene Reconstruction

Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks