Research on Monocular Depth Estimation Method Based on Multi-Level Attention and Feature Fusion

Zhongyu Wu,Hua Huang,Qishen Li,Penghui Chen
DOI: https://doi.org/10.1109/iaeac59436.2024.10503704
2024-01-01
Abstract:Monocular depth estimation is a fundamental task in computer vision and has drawn increasing attention. Recently, attention-based models and encoder-decoder architectures have led to great improvements in monocular depth estimation. Typically, most of the previous methods used repeated simple up-sampling operations during decoding, which may not make full use of the potential properties of the features extracted by the encoder, and there are problems of inaccurate prediction of the edge and depth maximum region. We propose an attention-based feature fusion module for encoder and decoder. We treat the monocular depth estimation as a pixel-level optimization problem, where the coarsest encoder feature is used to initialize the pixel-level optimization, which is then refined to higher resolution by the proposed attentional feature fusion (AFF). We formulate the prediction problem as ordinal regression over the bin centers that discretize the continuous depth range. It predicts a correspondingly different distribution of bins based on different pictures and we predict bins at the coarsest level using global pooling and MLP layers. In the NYUV2 dataset, the proposed architecture improving original model by 2.5.% and 1.1%, in terms of Log10 and Absolute relative error, respectively.
What problem does this paper attempt to address?