Enhancing Embodied Object Detection through Language-Image Pre-training and Implicit Object Memory

Nicolas Harvey Chapman,Feras Dayoub,Will Browne,Chris Lehnert

2024-02-06

Abstract:Deep-learning and large scale language-image training have produced image object detectors that generalise well to diverse environments and semantic classes. However, single-image object detectors trained on internet data are not optimally tailored for the embodied conditions inherent in robotics. Instead, robots must detect objects from complex multi-modal data streams involving depth, localisation and temporal correlation, a task termed embodied object detection. Paradigms such as Video Object Detection (VOD) and Semantic Mapping have been proposed to leverage such embodied data streams, but existing work fails to enhance performance using language-image training. In response, we investigate how an image object detector pre-trained using language-image data can be extended to perform embodied object detection. We propose a novel implicit object memory that uses projective geometry to aggregate the features of detected objects across long temporal horizons. The spatial and temporal information accumulated in memory is then used to enhance the image features of the base detector. When tested on embodied data streams sampled from diverse indoor scenes, our approach improves the base object detector by 3.09 mAP, outperforming alternative external memories designed for VOD and Semantic Mapping. Our method also shows a significant improvement of 16.90 mAP relative to baselines that perform embodied object detection without first training on language-image data, and is robust to sensor noise and domain shift experienced in real-world deployment.

Robotics

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the problem of object detection by robots in complex multimodal data streams. Specifically, object detectors trained on internet data for single images are not fully suitable for the conditions in robotic operating environments. Robots need to detect objects from complex multimodal data streams that include depth, localization, and temporal correlations, a task referred to as "Embodied Object Detection." #### Main Contributions: 1. **External Memory Augmentation Method**: A method is proposed to enhance the feature space of object detectors trained on internet-scale language-image data using external memory. 2. **Implicit Object Memory with Projective Geometry**: A method is proposed to maintain implicit external memory using projective geometry to capture long-term object dependencies. 3. **Detailed Evaluation**: The proposed methods are thoroughly evaluated on data streams collected by robots in indoor scenes, demonstrating superior performance in the task of embodied object detection and showing robustness to sensor noise and data domain changes. Through the above methods, the paper demonstrates a significant performance improvement over baselines that perform embodied object detection without first conducting language-image pretraining.

Enhancing Embodied Object Detection through Language-Image Pre-training and Implicit Object Memory

Move to See Better: Self-Improving Embodied Object Detection

VMM: Viewpoint-based Memory Mechanism for Object Detection of Moving Sensors

OmDet: Large‐scale vision‐language multi‐dataset pre‐training with multimodal detection network

Deep Affordance-Grounded Sensorimotor Object Recognition

On-line object detection: a robotics challenge

Embodied Object Representation Learning and Recognition

Geometric-aware Pretraining for Vision-centric 3D Object Detection

Learning Task-Aware Language-Image Representation for Class-Incremental Object Detection

Deep Active Perception for Object Detection using Navigation Proposals

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL

Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

Automatically Prepare Training Data for YOLO Using Robotic In-Hand Observation and Synthesis

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

Object Detection in the Context of Mobile Augmented Reality

Embodied Language Grounding with 3D Visual Feature Representations

To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training

Robots Autonomously Detecting People: A Multimodal Deep Contrastive Learning Method Robust to Intraclass Variations

Embodied Visual Recognition

Embodied vision for learning object representations