MLLM-Search: A Zero-Shot Approach to Finding People using Multimodal Large Language Models

Angus Fung,Aaron Hao Tan,Haitong Wang,Beno Benhabib,Goldie Nejat

2024-11-28

Abstract:Robotic search of people in human-centered environments, including healthcare settings, is challenging as autonomous robots need to locate people without complete or any prior knowledge of their schedules, plans or locations. Furthermore, robots need to be able to adapt to real-time events that can influence a person's plan in an environment. In this paper, we present MLLM-Search, a novel zero-shot person search architecture that leverages multimodal large language models (MLLM) to address the mobile robot problem of searching for a person under event-driven scenarios with varying user schedules. Our approach introduces a novel visual prompting method to provide robots with spatial understanding of the environment by generating a spatially grounded waypoint map, representing navigable waypoints by a topological graph and regions by semantic labels. This is incorporated into a MLLM with a region planner that selects the next search region based on the semantic relevance to the search scenario, and a waypoint planner which generates a search path by considering the semantically relevant objects and the local spatial context through our unique spatial chain-of-thought prompting approach. Extensive 3D photorealistic experiments were conducted to validate the performance of MLLM-Search in searching for a person with a changing schedule in different environments. An ablation study was also conducted to validate the main design choices of MLLM-Search. Furthermore, a comparison study with state-of-the art search methods demonstrated that MLLM-Search outperforms existing methods with respect to search efficiency. Real-world experiments with a mobile robot in a multi-room floor of a building showed that MLLM-Search was able to generalize to finding a person in a new unseen environment.

Robotics,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how a robot can find a specific person in a human - centered environment without complete or any prior knowledge. Specifically, an autonomous robot needs to locate people in the following situations: 1. **Lack of prior knowledge**: The robot cannot obtain the person's schedule, plan, or location information. 2. **Impact of real - time events**: A person's plan may change due to unexpected events (such as weather changes, appointment delays, etc.), and the robot needs to adapt to these changes. To solve these problems, the author proposes a new architecture named MLLM - Search, which utilizes multimodal large language models (MLLM) for zero - shot person search. The main features of MLLM - Search include: - **Visual cue method**: Generate a spatially grounded waypoint map with semantic region labels to help the robot understand the spatial layout of the environment. - **Region planner and waypoint planner**: Select the next search region based on semantic relevance and generate a search path that takes into account semantically related objects and local spatial context. Through these innovations, MLLM - Search can efficiently find the target person in a dynamic environment and perform well even when the user's schedule is incomplete or unavailable. ### Formula summary - **Objective function**: Minimize the expected distance of the robot from the starting position \(p_0\) to the target position \(p_t\): \[ \min \mathbb{E}[d(p_0, p_t)] \] - **Waypoint selection**: Select a safe point at least \( \sigma_{\text{min}}\) away from obstacles as a waypoint: \[ D(p)=\min_{o \in O}\|p - o\| \] \[ w_i=\arg\min_{p \in P}\|p - w_i\| \] - **Scoring mechanism**: Comprehensively consider the likelihood score \(s_L\), the proximity score \(s_P\), and the most recently visited score \(s_R\), and select the region with the highest score as the next search region: \[ r_{\text{next}}=\arg\max_{r \in R}(s_L + s_P + s_R) \] These formulas and methods enable MLLM - Search to complete the person - search task more intelligently and efficiently.

MLLM-Search: A Zero-Shot Approach to Finding People using Multimodal Large Language Models

Development of a Human-Robot Hybrid Intelligent System Based on Brain Teleoperation and Deep Learning SLAM

ChatNav: Leveraging LLM to Zero-shot Semantic Reasoning in Object Navigation

Interactive Natural Language-based Person Search

Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation

Semantic Mechanical Search with Large Vision and Language Models

Automatic Object Searching and Behavior Learning for Mobile Robots in Unstructured Environment by Deep Belief Networks.

VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

Large Language Models as Zero-Shot Human Models for Human-Robot Interaction

L3MVN: Leveraging Large Language Models for Visual Target Navigation

Semantic Linking Maps for Active Visual Object Search

SRLM: Human-in-Loop Interactive Social Robot Navigation with Large Language Model and Deep Reinforcement Learning

A Multirobot Person Search System for Finding Multiple Dynamic Users in Human-Centered Environments

Preferential Multi-Target Search in Indoor Environments using Semantic SLAM

VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation

GG-LLM: Geometrically Grounding Large Language Models for Zero-shot Human Activity Forecasting in Human-Aware Task Planning

Enhancing Socially-Aware Robot Navigation through Bidirectional Natural Language Conversation

LLM A: Human in the Loop Large Language Models Enabled A Search for Robotics

Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps

MLLM-Search: A Zero-Shot Approach to Finding People using Multimodal Large Language Models

Development of a Human-Robot Hybrid Intelligent System Based on Brain Teleoperation and Deep Learning SLAM

ChatNav: Leveraging LLM to Zero-shot Semantic Reasoning in Object Navigation

Interactive Natural Language-based Person Search

Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation

Semantic Mechanical Search with Large Vision and Language Models

Automatic Object Searching and Behavior Learning for Mobile Robots in Unstructured Environment by Deep Belief Networks.

VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

Large Language Models as Zero-Shot Human Models for Human-Robot Interaction

L3MVN: Leveraging Large Language Models for Visual Target Navigation

Semantic Linking Maps for Active Visual Object Search

SRLM: Human-in-Loop Interactive Social Robot Navigation with Large Language Model and Deep Reinforcement Learning

A Multirobot Person Search System for Finding Multiple Dynamic Users in Human-Centered Environments

Preferential Multi-Target Search in Indoor Environments using Semantic SLAM

VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation

GG-LLM: Geometrically Grounding Large Language Models for Zero-shot Human Activity Forecasting in Human-Aware Task Planning

Enhancing Socially-Aware Robot Navigation through Bidirectional Natural Language Conversation

LLM A*: Human in the Loop Large Language Models Enabled A* Search for Robotics

Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps

LLM A: Human in the Loop Large Language Models Enabled A Search for Robotics