Labeling Indoor Scenes with Fusion of Out-of-the-Box Perception Models

Yimeng Li,Navid Rajabi,Sulabh Shrestha,Md Alimoor Reza,Jana Kosecka
2023-11-18
Abstract:The image annotation stage is a critical and often the most time-consuming part required for training and evaluating object detection and semantic segmentation models. Deployment of the existing models in novel environments often requires detecting novel semantic classes not present in the training data. Furthermore, indoor scenes contain significant viewpoint variations, which need to be handled properly by trained perception models. We propose to leverage the recent advancements in state-of-the-art models for bottom-up segmentation (SAM), object detection (Detic), and semantic segmentation (MaskFormer), all trained on large-scale datasets. We aim to develop a cost-effective labeling approach to obtain pseudo-labels for semantic segmentation and object instance detection in indoor environments, with the ultimate goal of facilitating the training of lightweight models for various downstream tasks. We also propose a multi-view labeling fusion stage, which considers the setting where multiple views of the scenes are available and can be used to identify and rectify single-view inconsistencies. We demonstrate the effectiveness of the proposed approach on the Active Vision dataset and the ADE20K dataset. We evaluate the quality of our labeling process by comparing it with human annotations. Also, we demonstrate the effectiveness of the obtained labels in downstream tasks such as object goal navigation and part discovery. In the context of object goal navigation, we depict enhanced performance using this fusion approach compared to a zero-shot baseline that utilizes large monolithic vision-language pre-trained models.
Computer Vision and Pattern Recognition,Computation and Language,Robotics
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address several key issues in indoor scene image annotation: 1. **Annotation Bottleneck in Training and Evaluating Models**: - Image annotation is a crucial step in training and evaluating object detection and semantic segmentation models, but it is often very time-consuming. For example, annotating and validating the HM3D dataset requires over 14,200 hours of manual effort. 2. **Detection of New Categories in New Environments**: - When deploying existing models in new environments, it is often necessary to detect new semantic categories that did not appear in the training data. Significant viewpoint variations in indoor scenes pose higher demands on the trained perception models. 3. **Multi-View Consistency**: - There are various viewpoint changes in indoor scenes, and single-view annotations may produce inconsistencies. Therefore, a method is needed to identify and correct these single-view inconsistencies. ### Solution To address the above issues, the authors propose a method that integrates existing state-of-the-art models (such as the bottom-up segmentation model SAM, object detection model Detic, and semantic segmentation model MaskFormer) to obtain pseudo-labels in indoor environments. Specifically: 1. **Single-View Annotation**: - Use Detic for foreground category detection, use MaskFormer for background category semantic segmentation, and combine SAM to generate high-quality segmentation masks. 2. **Multi-View Fusion**: - When multiple views of scene images are available, identify and correct single-view inconsistencies through a multi-view fusion stage to improve annotation quality. 3. **Downstream Task Evaluation**: - Evaluate the quality of the annotation results through experiments on the Active Vision dataset and ADE20K dataset, and demonstrate their effectiveness in downstream tasks such as object goal navigation and part discovery. ### Main Contributions 1. **Design of a Fusion Prediction Annotation Method**: - Integrate the prediction results of state-of-the-art semantic segmentation and object detection models to obtain semantic labels for class-agnostic image segmentation. 2. **Enhancement of Multi-View Annotation Consistency**: - Enhance single-view annotations through multi-view semantic consistency on the Active Vision dataset and compare them with manual annotation results. 3. **Validation of Effectiveness in Downstream Tasks**: - Demonstrate the effectiveness of the fusion segmentation results through object goal navigation and part discovery tasks, especially in zero-shot settings. ### Experimental Results - **Annotation Quality Evaluation**: - Validate the superior performance of the proposed method in semantic segmentation and small object segmentation tasks by comparing it with manual annotations. - **Downstream Task Performance**: - In the object goal navigation task, the proposed method outperforms zero-shot baseline methods based on large-scale multimodal vision-language pre-trained models. In summary, this paper provides an efficient and high-quality indoor scene image annotation method by integrating existing state-of-the-art models, offering strong support for downstream tasks.