Towards Label-free Scene Understanding by Vision Foundation Models

Runnan Chen,Youquan Liu,Lingdong Kong,Nenglun Chen,Xinge Zhu,Yuexin Ma,Tongliang Liu,Wenping Wang

2023-10-30

Abstract:Vision foundation models such as Contrastive Vision-Language Pre-training (CLIP) and Segment Anything (SAM) have demonstrated impressive zero-shot performance on image classification and segmentation tasks. However, the incorporation of CLIP and SAM for label-free scene understanding has yet to be explored. In this paper, we investigate the potential of vision foundation models in enabling networks to comprehend 2D and 3D worlds without labelled data. The primary challenge lies in effectively supervising networks under extremely noisy pseudo labels, which are generated by CLIP and further exacerbated during the propagation from the 2D to the 3D domain. To tackle these challenges, we propose a novel Cross-modality Noisy Supervision (CNS) method that leverages the strengths of CLIP and SAM to supervise 2D and 3D networks simultaneously. In particular, we introduce a prediction consistency regularization to co-train 2D and 3D networks, then further impose the networks' latent space consistency using the SAM's robust feature representation. Experiments conducted on diverse indoor and outdoor datasets demonstrate the superior performance of our method in understanding 2D and 3D open environments. Our 2D and 3D network achieves label-free semantic segmentation with 28.4\% and 33.5\% mIoU on ScanNet, improving 4.7\% and 7.9\%, respectively. For nuImages and nuScenes datasets, the performance is 22.1\% and 26.8\% with improvements of 3.5\% and 6.0\%, respectively. Code is available. (<a class="link-external link-https" href="https://github.com/runnanchen/Label-Free-Scene-Understanding" rel="external noopener nofollow">this https URL</a>).

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve scene understanding without labeled data (i.e., unlabeled scene understanding). Specifically, although existing scene understanding methods have made significant progress in 2D and 3D scene understanding, these methods rely heavily on a large amount of labeled data and perform poorly when faced with new object categories that do not appear in the training data. These problems limit their practical application capabilities in the real world because obtaining high - quality labeled data is both expensive and time - consuming, and new objects may appear in the real environment. Therefore, unlabeled scene understanding has become a highly valuable but relatively unexplored research topic. The paper proposes a novel Cross - modality Noisy Supervision (CNS) framework, which utilizes visual foundation models such as the Contrastive Vision - Language Pre - training (CLIP) and the Segment Anything Model (SAM) to train 2D and 3D networks simultaneously, thereby achieving an understanding of 2D and 3D environments without relying on labeled data. Through this method, the paper aims to overcome the limitations in existing methods due to the need for labeled data and the problem of new object recognition, and improve the unlabeled semantic segmentation performance of the network in an open environment.

Towards Label-free Scene Understanding by Vision Foundation Models

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation

Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding

OV-NeRF: Open-vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding

CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

Model2Scene: Learning 3D Scene Representation via Contrastive Language-CAD Models Pre-training

TAMC: Textual Alignment and Masked Consistency for Open-Vocabulary 3D Scene Understanding

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance

OpenScene: 3D Scene Understanding with Open Vocabularies

Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation