Abstract:The image annotation stage is a critical and often the most time-consuming part required for training and evaluating object detection and semantic segmentation models. Deployment of the existing models in novel environments often requires detecting novel semantic classes not present in the training data. Furthermore, indoor scenes contain significant viewpoint variations, which need to be handled properly by trained perception models. We propose to leverage the recent advancements in state-of-the-art models for bottom-up segmentation (SAM), object detection (Detic), and semantic segmentation (MaskFormer), all trained on large-scale datasets. We aim to develop a cost-effective labeling approach to obtain pseudo-labels for semantic segmentation and object instance detection in indoor environments, with the ultimate goal of facilitating the training of lightweight models for various downstream tasks. We also propose a multi-view labeling fusion stage, which considers the setting where multiple views of the scenes are available and can be used to identify and rectify single-view inconsistencies. We demonstrate the effectiveness of the proposed approach on the Active Vision dataset and the ADE20K dataset. We evaluate the quality of our labeling process by comparing it with human annotations. Also, we demonstrate the effectiveness of the obtained labels in downstream tasks such as object goal navigation and part discovery. In the context of object goal navigation, we depict enhanced performance using this fusion approach compared to a zero-shot baseline that utilizes large monolithic vision-language pre-trained models.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address several key issues in indoor scene image annotation: 1. **Annotation Bottleneck in Training and Evaluating Models**: - Image annotation is a crucial step in training and evaluating object detection and semantic segmentation models, but it is often very time-consuming. For example, annotating and validating the HM3D dataset requires over 14,200 hours of manual effort. 2. **Detection of New Categories in New Environments**: - When deploying existing models in new environments, it is often necessary to detect new semantic categories that did not appear in the training data. Significant viewpoint variations in indoor scenes pose higher demands on the trained perception models. 3. **Multi-View Consistency**: - There are various viewpoint changes in indoor scenes, and single-view annotations may produce inconsistencies. Therefore, a method is needed to identify and correct these single-view inconsistencies. ### Solution To address the above issues, the authors propose a method that integrates existing state-of-the-art models (such as the bottom-up segmentation model SAM, object detection model Detic, and semantic segmentation model MaskFormer) to obtain pseudo-labels in indoor environments. Specifically: 1. **Single-View Annotation**: - Use Detic for foreground category detection, use MaskFormer for background category semantic segmentation, and combine SAM to generate high-quality segmentation masks. 2. **Multi-View Fusion**: - When multiple views of scene images are available, identify and correct single-view inconsistencies through a multi-view fusion stage to improve annotation quality. 3. **Downstream Task Evaluation**: - Evaluate the quality of the annotation results through experiments on the Active Vision dataset and ADE20K dataset, and demonstrate their effectiveness in downstream tasks such as object goal navigation and part discovery. ### Main Contributions 1. **Design of a Fusion Prediction Annotation Method**: - Integrate the prediction results of state-of-the-art semantic segmentation and object detection models to obtain semantic labels for class-agnostic image segmentation. 2. **Enhancement of Multi-View Annotation Consistency**: - Enhance single-view annotations through multi-view semantic consistency on the Active Vision dataset and compare them with manual annotation results. 3. **Validation of Effectiveness in Downstream Tasks**: - Demonstrate the effectiveness of the fusion segmentation results through object goal navigation and part discovery tasks, especially in zero-shot settings. ### Experimental Results - **Annotation Quality Evaluation**: - Validate the superior performance of the proposed method in semantic segmentation and small object segmentation tasks by comparing it with manual annotations. - **Downstream Task Performance**: - In the object goal navigation task, the proposed method outperforms zero-shot baseline methods based on large-scale multimodal vision-language pre-trained models. In summary, this paper provides an efficient and high-quality indoor scene image annotation method by integrating existing state-of-the-art models, offering strong support for downstream tasks.

Labeling Indoor Scenes with Fusion of Out-of-the-Box Perception Models

ADeLA: Automatic Dense Labeling with Attention for Viewpoint Shift in Semantic Segmentation

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

LabelFormer: Object Trajectory Refinement for Offboard Perception from LiDAR Point Clouds

Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation

Automated Multimodal Data Annotation via Calibration With Indoor Positioning System

Visual Boundary-Guided Pseudo-Labeling for Weakly Supervised 3D Point Cloud Segmentation in Indoor Environments

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

LABELMAKER: Automatic Semantic Label Generation from RGB-D Trajectories

Label-Efficient 3D Object Detection For Road-Side Units

Towards Label-free Scene Understanding by Vision Foundation Models

Detecting As Labeling: Rethinking LiDAR-camera Fusion in 3D Object Detection

Visual Foundation Models Boost Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation

Labeling 3D scenes for Personal Assistant Robots

Make a Strong Teacher with Label Assistance: A Novel Knowledge Distillation Approach for Semantic Segmentation

FM-Fusion: Instance-Aware Semantic Mapping Boosted by Vision-Language Foundation Models

Learning Semantic Segmentation on Unlabeled Real-World Indoor Point Clouds via Synthetic Data

An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets

Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

Joint Global and Dynamic Pseudo Labeling for Semi-Supervised Point Cloud Sequence Segmentation