Abstract:In the last years, the research interest in visual navigation towards objects in indoor environments has grown significantly. This growth can be attributed to the recent availability of large navigation datasets in photo-realistic simulated environments, like Gibson and Matterport3D. However, the navigation tasks supported by these datasets are often restricted to the objects present in the environment at acquisition time. Also, they fail to account for the realistic scenario in which the target object is a user-specific instance that can be easily confused with similar objects and may be found in multiple locations within the environment. To address these limitations, we propose a new task denominated Personalized Instance-based Navigation (PIN), in which an embodied agent is tasked with locating and reaching a specific personal object by distinguishing it among multiple instances of the same category. The task is accompanied by PInNED, a dedicated new dataset composed of photo-realistic scenes augmented with additional 3D objects. In each episode, the target object is presented to the agent using two modalities: a set of visual reference images on a neutral background and manually annotated textual descriptions. Through comprehensive evaluations and analyses, we showcase the challenges of the PIN task as well as the performance and shortcomings of currently available methods designed for object-driven navigation, considering modular and end-to-end agents.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve personalized instance - based navigation (PIN) for user - specific objects in real - world environments. Specifically, this research aims to develop a method that enables agents to find and navigate to a specific personal item in a complex indoor environment, rather than just recognizing and navigating to general - category objects. For example, the agent needs to be able to distinguish between multiple items of the same category (such as multiple teddy bears) and accurately find the specific item specified by the user. ### Main Problems and Challenges 1. **Limitations of Existing Datasets**: - Existing navigation datasets usually only contain objects that already exist when the environment is acquired. - These datasets fail to take into account that the target object in a real - world scenario may be a user - specific instance, and these instances may be confused with other similar objects. - The target object may appear in multiple locations in the environment, increasing the complexity of the task. 2. **Difficulties in Personalized Instance Recognition**: - It is necessary to recognize specific instances through reference images and text descriptions without context information. - It is necessary to deal with multiple distractor objects within the same category, and these distractor objects may be very similar to the target object. 3. **Processing of Multimodal Inputs**: - The agent needs to process visual references (such as RGB images) and text descriptions simultaneously to accurately identify the target object. - It is necessary to design effective mechanisms to fuse and utilize these two different forms of input information. ### Solutions To solve the above problems, the author proposes the following solutions: - **New Task Definition (PIN)**: Introduce the personalized instance - based navigation task, which requires the agent to find a specific personal item through reference images and text descriptions without relying on the surrounding environment. - **New Dataset (PInNED)**: Construct a new dataset that contains 338 additional three - dimensional objects. These objects can be placed in different environments and can be moved to different locations. Each instance is equipped with a visual reference image and a text description for training and evaluating the agent. - **Benchmark Testing and Analysis**: Conduct an extensive evaluation of existing navigation agents, showing their performance and deficiencies in handling PIN tasks, especially in the comparison between modular and end - to - end methods. ### Formula Representation To ensure the correctness and readability of formulas, here are some formula examples involved in the paper: - **Matching Score Calculation**: \[ S=\sum_{i = 1}^{n}c_i \] where \(S\) is the matching score and \(c_i\) is the confidence score of each matching keypoint. - **Euclidean Distance Threshold**: \[ d(x_t,z)<1\ \text{meter} \] where \(x_t\) is the position of the agent at the current time step and \(z\) is the target position. Through these improvements, the author hopes to promote further research and development in the field of personalized instance - based navigation, especially the feasibility in real - world applications.

Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments

Image-based Navigation in Real-World Environments via Multiple Mid-level Representations: Fusion Models, Benchmark and Efficient Evaluation

Right Place, Right Time! Generalizing ObjectNav to Dynamic Environments with Portable Targets

3D-Aware Object Goal Navigation Via Simultaneous Exploration and Identification

Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input

Embodied Navigation at the Art Gallery

Out of the Box: Embodied Navigation in the Real World

Navigating to Objects Specified by Images

Inavigation: an Image Based Indoor Navigation System

NaVIP: An Image-Centric Indoor Navigation Solution for Visually Impaired People

Embodied Question Answering in Photorealistic Environments With Point Cloud Perception

Multi-Object Navigation with dynamically learned neural implicit representations

Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation

NavigationNet: A Large-scale Interactive Indoor Navigation Dataset

Unsupervised Visual Odometry and Action Integration for PointGoal Navigation in Indoor Environment

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

Towards Target-Driven Visual Navigation in Indoor Scenes via Generative Imitation Learning

IN-Sight: Interactive Navigation through Sight

Multi-Object Navigation Using Potential Target Position Policy Function

PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

PlaceNav: Topological Navigation through Place Recognition