Abstract:Recent vision foundation models (VFMs) have demonstrated proficiency in various tasks but require supervised fine-tuning to perform the task of semantic segmentation effectively. Benchmarking their performance is essential for selecting current models and guiding future model developments for this task. The lack of a standardized benchmark complicates comparisons. Therefore, the primary objective of this paper is to study how VFMs should be benchmarked for semantic segmentation. To do so, various VFMs are fine-tuned under various settings, and the impact of individual settings on the performance ranking and training time is assessed. Based on the results, the recommendation is to fine-tune the ViT-B variants of VFMs with a 16x16 patch size and a linear decoder, as these settings are representative of using a larger model, more advanced decoder and smaller patch size, while reducing training time by more than 13 times. Using multiple datasets for training and evaluation is also recommended, as the performance ranking across datasets and domain shifts varies. Linear probing, a common practice for some VFMs, is not recommended, as it is not representative of end-to-end fine-tuning. The benchmarking setup recommended in this paper enables a performance analysis of VFMs for semantic segmentation. The findings of such an analysis reveal that pretraining with promptable segmentation is not beneficial, whereas masked image modeling (MIM) with abstract representations is crucial, even more important than the type of supervision used. The code for efficiently fine-tuning VFMs for semantic segmentation can be accessed through the project page at: <a class="link-external link-https" href="https://tue-mps.github.io/benchmark-vfm-ss/" rel="external noopener nofollow">this https URL</a>.

First Place Solution to the ECCV 2024 BRAVO Challenge: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation

The BRAVO Semantic Segmentation Challenge Results in UNCV2024

How to Benchmark Vision Foundation Models for Semantic Segmentation?

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS

1st Place Solution for 5th LSVOS Challenge: Referring Video Object Segmentation

1st Place Solution for ICCV 2023 OmniObject3D Challenge: Sparse-View Reconstruction

Exploring the Benefits of Vision Foundation Models for Unsupervised Domain Adaptation

Robustness Analysis on Foundational Segmentation Models

1st Place Solution for the UVO Challenge on Image-based Open-World Segmentation 2021

vFusedSeg3D: 3rd Place Solution for 2024 Waymo Open Dataset Challenge in Semantic Segmentation

1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

UVO Challenge on Video-based Open-World Segmentation 2021: 1st Place Solution

First Place Solution to the ECCV 2024 ROAD++ Challenge @ ROAD++ Spatiotemporal Agent Detection 2024

Towards Fine-grained Large Object Segmentation 1st Place Solution to 3D AI Challenge 2020 -- Instance Segmentation Track

Benchmarking Robust Self-Supervised Learning Across Diverse Downstream Tasks

1st Place Solutions for OpenImage2019 -- Object Detection and Instance Segmentation

2nd Place Solution to ECCV 2020 VIPriors Object Detection Challenge

The 2nd Solution for LSVOS Challenge RVOS Track: Spatial-temporal Refinement for Consistent Semantic Segmentation