Enhancing Embodied Object Detection through Language-Image Pre-training and Implicit Object Memory

Nicolas Harvey Chapman,Feras Dayoub,Will Browne,Chris Lehnert
2024-02-06
Abstract:Deep-learning and large scale language-image training have produced image object detectors that generalise well to diverse environments and semantic classes. However, single-image object detectors trained on internet data are not optimally tailored for the embodied conditions inherent in robotics. Instead, robots must detect objects from complex multi-modal data streams involving depth, localisation and temporal correlation, a task termed embodied object detection. Paradigms such as Video Object Detection (VOD) and Semantic Mapping have been proposed to leverage such embodied data streams, but existing work fails to enhance performance using language-image training. In response, we investigate how an image object detector pre-trained using language-image data can be extended to perform embodied object detection. We propose a novel implicit object memory that uses projective geometry to aggregate the features of detected objects across long temporal horizons. The spatial and temporal information accumulated in memory is then used to enhance the image features of the base detector. When tested on embodied data streams sampled from diverse indoor scenes, our approach improves the base object detector by 3.09 mAP, outperforming alternative external memories designed for VOD and Semantic Mapping. Our method also shows a significant improvement of 16.90 mAP relative to baselines that perform embodied object detection without first training on language-image data, and is robust to sensor noise and domain shift experienced in real-world deployment.
Robotics
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the problem of object detection by robots in complex multimodal data streams. Specifically, object detectors trained on internet data for single images are not fully suitable for the conditions in robotic operating environments. Robots need to detect objects from complex multimodal data streams that include depth, localization, and temporal correlations, a task referred to as "Embodied Object Detection." #### Main Contributions: 1. **External Memory Augmentation Method**: A method is proposed to enhance the feature space of object detectors trained on internet-scale language-image data using external memory. 2. **Implicit Object Memory with Projective Geometry**: A method is proposed to maintain implicit external memory using projective geometry to capture long-term object dependencies. 3. **Detailed Evaluation**: The proposed methods are thoroughly evaluated on data streams collected by robots in indoor scenes, demonstrating superior performance in the task of embodied object detection and showing robustness to sensor noise and data domain changes. Through the above methods, the paper demonstrates a significant performance improvement over baselines that perform embodied object detection without first conducting language-image pretraining.