RTIA-Mono: Real-Time Lightweight Self-Supervised Monocular Depth Estimation with Global-Local Information Aggregation
Bowen Zhao,Hongdou He,Hang Xu,Peng Shi,Xiaobing Hao,Guoyan Huang
DOI: https://doi.org/10.1016/j.dsp.2024.104769
IF: 2.92
2024-01-01
Digital Signal Processing
Abstract:Self-supervised monocular depth estimation has attracted significant attention in computer vision, especially for applications like autonomous driving and robotics. Recently, CNNs and Transformers have achieved tremendous success in this task. However, existing research primarily focuses on improving estimation accuracy, increasing model complexity poses challenges for deployment on edge computing devices. Shallow CNNs aid lightweight network construction but suffer limited receptive fields, hindering fusion of local geometric features and global semantic information. To address these issues, we propose an efficient real-time lightweight self-supervised architecture, RTIA-Mono, for monocular depth estimation. Firstly, we design a cross-stage feature fusion structure promoting feature aggregation and fusion across stages. Secondly, in each stage, we propose a Global Local Information Aggregation (GLIA) module integrating advantages of CNNs and Transformers to aggregate local and global features. Additionally, we introduce a Directional Feature Enhancement (DFE) module supplementing spatial structure information to mitigate spatial information loss from downsampling. Through sophisticated design, the proposed approach outperforms state-of-the-art methods on KITTI benchmark with the least parameters, and achieves a good balance between accuracy, complexity and inference speed. Furthermore, RTIA-Mono demonstrates excellent generalization on other datasets.