Abstract:Although fully convolution networks (FCNs) have dominated dense prediction tasks (e.g., semantic seg-mentation, depth estimation and object detection) for decades, they are inherently limited in captur-ing long-range structured relationship with the layers of local kernels. While recent Transformer-based models have proven extremely successful in computer vision tasks by capturing global representation, they would deteriorate dense prediction results by over-smoothing the regions containing fine details (e.g., boundaries and small objects). To this end, we aim to provide an alternative perspective by re-thinking local and global feature representation for the dense prediction task. Specifically, we deploy a Dual-Stream Convolution-Transformer architecture, called DSCT, by taking advantage of both the convolu-tion and Transformer to learn a rich feature representation, combining with a task decoder to provide a powerful dense prediction model. DSCT extracts high resolution local feature representation from convo-lution layers and global feature representation from Transformer layers. With the local and global context modeled explicitly in every layer, the two streams can be combined with a decoder to perform task of se-mantic segmentation, monocular depth estimation or object detection. Extensive experiments show that DSCT can achieve superior performance on the three tasks above. For semantic segmentation, DSCT builds a new state of the art on Cityscapes validation set (83.31% mIoU) with only 80,0 0 0 training iterations and appealing performance (49.27% mIoU) on ADE20K validation set, outperforming most of the alternatives. For monocular depth estimation, our model achieves 2.423 RMSE on KITTI Eigen split, superior to most of the convolution or Transformer counterparts. For object detection, without using FPN, we can achieve 44.5% APb on COCO dataset when using Faster R-CNN, which is higher than Conformer.(c) 2022 Elsevier Ltd. All rights reserved.

Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

Vision Transformers for Dense Prediction

Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

TransDSSL: Transformer Based Depth Estimation via Self-Supervised Learning

A Transformer-Based Image-Guided Depth-Completion Model with Dual-Attention Fusion Module

Rethinking Local and Global Feature Representation for Dense Prediction

Depth Estimation with Simplified Transformer

A transformer-CNN parallel network for image guided depth completion

Dense Transformer Networks

Vision Transformers: From Semantic Segmentation to Dense Prediction

GSSTU: Generative Spatial Self-Attention Transformer Unit for Enhanced Video Prediction

Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

Predictive Attention Transformer: Improving Transformer with Attention Map Prediction

Monocular Depth Estimation Algorithm Integrating Parallel Transformer and Multi-Scale Features

Demystify Transformers & Convolutions in Modern Image Deep Networks

Towards Comprehensive Monocular Depth Estimation: Multiple Heads are Better Than One

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

STDepthFormer: Predicting Spatio-temporal Depth from Video with a Self-supervised Transformer Model

Glance-and-Gaze Vision Transformer

TAMDepth: self-supervised monocular depth estimation with transformer and adapter modulation