Online Embedding Multi-Scale CLIP Features into 3D Maps

Shun Taguchi,Hideki Deguchi
2024-03-27
Abstract:This study introduces a novel approach to online embedding of multi-scale CLIP (Contrastive Language-Image Pre-Training) features into 3D maps. By harnessing CLIP, this methodology surpasses the constraints of conventional vocabulary-limited methods and enables the incorporation of semantic information into the resultant maps. While recent approaches have explored the embedding of multi-modal features in maps, they often impose significant computational costs, lacking practicality for exploring unfamiliar environments in real time. Our approach tackles these challenges by efficiently computing and embedding multi-scale CLIP features, thereby facilitating the exploration of unfamiliar environments through real-time map generation. Moreover, the embedding CLIP features into the resultant maps makes offline retrieval via linguistic queries feasible. In essence, our approach simultaneously achieves real-time object search and mapping of unfamiliar environments. Additionally, we propose a zero-shot object-goal navigation system based on our mapping approach, and we validate its efficacy through object-goal navigation, offline object retrieval, and multi-object-goal navigation in both simulated environments and real robot experiments. The findings demonstrate that our method not only exhibits swifter performance than state-of-the-art mapping methods but also surpasses them in terms of the success rate of object-goal navigation tasks.
Robotics,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in complex and unknown environments, how to efficiently embed multi - scale CLIP (Contrastive Language - Image Pretraining) features into 3D maps in real - time, and achieve object retrieval and navigation based on natural language queries. Specifically, the paper mainly focuses on the following aspects: 1. **Vocabulary Limitation**: Traditional methods rely on a fixed vocabulary for object detection and semantic segmentation, and are unable to capture the diverse semantic information in the real - world environment. 2. **Computational Complexity**: Existing methods often face high computational costs when embedding multi - modal features, and it is difficult to achieve real - time applications, especially when exploring unfamiliar environments. 3. **Multi - scale Adaptability**: Single - scale feature embedding methods lack flexibility and cannot adapt to query requirements at different scales. For example, a sink in the kitchen is a "sink" at a small scale, but represents a "kitchen" at a larger scale. To solve these problems, the author proposes a novel method to achieve improvement in the following ways: - **Multi - scale CLIP Feature Embedding**: Use the CLIP model to extract multi - scale visual - language features and efficiently embed them into 3D maps. - **Real - time Performance**: Ensure that the method can operate in a real - time environment and support the adaptability to dynamic environmental changes. - **Zero - shot Navigation**: Develop a zero - shot object - goal navigation system based on the generated map, which can handle open - vocabulary queries. ### Specific Problem Summary 1. **Vocabulary Limitation Problem**: By using the CLIP model, this method can go beyond the traditional fixed vocabulary and support the embedding of semantic information in open vocabularies. 2. **Computational Efficiency Problem**: Through batch processing and multi - scale feature extraction, this method significantly reduces the computational cost and achieves efficient real - time embedding. 3. **Multi - scale Adaptability Problem**: Through multi - scale feature embedding, this method can adapt to query requirements at different scales, improving the practicality and flexibility of the map. These improvements enable this method not only to achieve real - time object search and mapping in unknown environments, but also to enhance the practical application value of the map through the offline retrieval function.