MMAIndoor: Patched MLP and Multi-dimensional Cross Attention Based Self-supervised Indoor Depth Estimation

Chen Lv,Chenggong Han,Tianshu Song,He Jiang,Qiqi Kou,Jiansheng Qian,Deqiang Cheng
DOI: https://doi.org/10.1016/j.neucom.2024.127972
IF: 6
2024-01-01
Neurocomputing
Abstract:Depth estimation can provide auxiliary information for scene perception. Generally, extensive textureless surfaces, such as walls and ceilings, exist in indoor environments, and they share similar scene and semantic content. Overly consistent features of local textureless areas fail to reflect changes in depth information, thus degrading the performance of existing depth estimation methods. In response to this challenge, we propose a special indoor depth estimation method, named as MMAIndoor, which can provide global semantic guidance and shape priors for local textureless depth estimation. The depth estimation network is designed efficiently, encompassing the initial convolutional stage and the latent patched multi-layer perceptron (Pat-MLP) stage. The novel Pat-MLP block utilizes MLP partitioning to globally model depth-local information from the convolutional stage and it incorporate axial shift operations to extract local information from various spatial locations, suppressing the smoothing effect of MLP and improving precise estimation of sharp depth changes or small structures indoors. Further, we build a multi-dimensional cross attention (MCA) module to address the weak correlation of the current residual connections for the global context. This MCA captures global dependencies across multi-dimensions by sequentially executing cross attention on both channels and spatial, and effectively mitigate semantic gaps in residual connections. Sufficient experimental results demonstrate the state-of-the-art performance of MMAIndoor on benchmark datasets including NYUv2, ScanNet, and InteriorNet.
What problem does this paper attempt to address?