Abstract:Although fully convolution networks (FCNs) have dominated dense prediction tasks (e.g., semantic seg-mentation, depth estimation and object detection) for decades, they are inherently limited in captur-ing long-range structured relationship with the layers of local kernels. While recent Transformer-based models have proven extremely successful in computer vision tasks by capturing global representation, they would deteriorate dense prediction results by over-smoothing the regions containing fine details (e.g., boundaries and small objects). To this end, we aim to provide an alternative perspective by re-thinking local and global feature representation for the dense prediction task. Specifically, we deploy a Dual-Stream Convolution-Transformer architecture, called DSCT, by taking advantage of both the convolu-tion and Transformer to learn a rich feature representation, combining with a task decoder to provide a powerful dense prediction model. DSCT extracts high resolution local feature representation from convo-lution layers and global feature representation from Transformer layers. With the local and global context modeled explicitly in every layer, the two streams can be combined with a decoder to perform task of se-mantic segmentation, monocular depth estimation or object detection. Extensive experiments show that DSCT can achieve superior performance on the three tasks above. For semantic segmentation, DSCT builds a new state of the art on Cityscapes validation set (83.31% mIoU) with only 80,0 0 0 training iterations and appealing performance (49.27% mIoU) on ADE20K validation set, outperforming most of the alternatives. For monocular depth estimation, our model achieves 2.423 RMSE on KITTI Eigen split, superior to most of the convolution or Transformer counterparts. For object detection, without using FPN, we can achieve 44.5% APb on COCO dataset when using Faster R-CNN, which is higher than Conformer.(c) 2022 Elsevier Ltd. All rights reserved.

DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency.

Vision Transformers for Dense Prediction

Upsampling DINOv2 features for unsupervised vision tasks and weakly supervised materials segmentation

Mask DINO: Towards a Unified Transformer-Based Framework for Object Detection and Segmentation

Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning

DiT: Efficient Vision Transformers with Dynamic Token Routing

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

Vision Transformer with Super Token Sampling

UIFormer: A Unified Transformer-based Framework for Incremental Few-Shot Object Detection and Instance Segmentation

SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction

MST: Masked Self-Supervised Transformer for Visual Representation

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

Analyzing Local Representations of Self-supervised Vision Transformers

Change Dino: A Unified Transformer-Based Framework for Object-Level Change Detection and Segmentation in Remote Sensing Imagery

Dense Transformer Networks

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model

Rethinking Local and Global Feature Representation for Dense Prediction