MDEConvFormer: estimating monocular depth as soft regression based on convolutional transformer

Wen Su,Ye He,Haifeng Zhang,Wenzhen Yang
DOI: https://doi.org/10.1007/s11042-024-18290-0
IF: 2.577
2024-01-27
Multimedia Tools and Applications
Abstract:Estimating depth from a single monocular image is a promising but challenging task in scene understanding. While Convolutional Neural Networks have been the dominant architectures, recently Vision Transformers have been gaining momentum to take over in pixel-level classification tasks. However, as a substantial regression problem, depth estimation normally requires effective multi-scale context. The chasm of transferring above two architectures to regression has received little research attention. We are committed to focusing on the fusion of multiple scales in both structures and error compensation for the transformation from classification to regression. We also concerned with the fact that the loss function in traditional regression tasks is usually built on the similarity of pixel prediction. The high-level similarity constraints of pixels are usually ignored. This paper explores the feasibility of performing soft regression on the probability distribution of classification generated from the proposed convolutional transformer. A well-designed deep learning model utilizes a multi-scale context of both convolutional networks and transformers. Pyramidally repeated context fusions guarantee that each representation receives multi-scale contextual information from parallel representations. We allow each depth class to be shifted adaptively and depth estimation is calculated as the expected value of probability distribution. Homogeneous embedding loss is used for transferring task-specific appealing properties such as geometric information, semantic cues as well as global context. Experiments subsequently confirm competitive results on the popular indoor and outdoor datasets compared with the recent state-of-the-art methods.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?