Depth Estimation Algorithm Based on Transformer-Encoder and Feature Fusion

Linhan Xia,Junbang Liu,Tong Wu
2024-03-03
Abstract:This research presents a novel depth estimation algorithm based on a Transformer-encoder architecture, tailored for the NYU and KITTI Depth Dataset. This research adopts a transformer model, initially renowned for its success in natural language processing, to capture intricate spatial relationships in visual data for depth estimation tasks. A significant innovation of the research is the integration of a composite loss function that combines Structural Similarity Index Measure (SSIM) with Mean Squared Error (MSE). This combined loss function is designed to ensure the structural integrity of the predicted depth maps relative to the original images (via SSIM) while minimizing pixel-wise estimation errors (via MSE). This research approach addresses the challenges of over-smoothing often seen in MSE-based losses and enhances the model's ability to predict depth maps that are not only accurate but also maintain structural coherence with the input images. Through rigorous training and evaluation using the NYU Depth Dataset, the model demonstrates superior performance, marking a significant advancement in single-image depth estimation, particularly in complex indoor and traffic environments.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper aims to solve the problem of monocular image depth estimation. Specifically, the author proposes a depth estimation algorithm based on the Transformer encoder architecture and feature fusion to improve the accuracy of depth estimation in complex indoor and traffic environments. Traditional convolution methods may encounter difficulties when dealing with complex scenes, while the Transformer model, due to its successful application in the field of natural language processing, is introduced into visual data to capture complex spatial relationships. ### Main contributions of the paper: 1. **Innovative Transformer encoder architecture**: Use the Transformer model to capture complex spatial relationships in images and improve the accuracy of depth estimation. 2. **Composite loss function**: A composite loss function that combines the Structural Similarity Index Measure (SSIM) and the Mean Squared Error (MSE) to ensure that the predicted depth map is structurally consistent with the original image while minimizing pixel - level errors. 3. **Feature fusion technology**: Improve the accuracy and efficiency of depth estimation by fusing feature matrices in the frequency domain and the spatial domain. ### Specific problems: - **Over - smoothing problem**: The traditional MSE loss function is prone to over - smoothing, which affects the structural integrity of the depth map. - **Accuracy in complex scenes**: In complex indoor and traffic environments, traditional methods may not be able to provide high - precision depth estimation. ### Solutions: - **Transformer encoder**: Utilize the long - distance dependency modeling ability of the Transformer model to extract more abundant feature information. - **Composite loss function**: Combine SSIM and MSE to balance structural integrity and pixel - level accuracy. - **Feature fusion**: Combine features in the frequency domain and the spatial domain to improve the model's adaptability to complex scenes. Through these innovations, this research has demonstrated superior performance on the NYU Depth Dataset and the KITTI Dataset, especially in complex indoor environments.