Abstract:As the number of service robots and autonomous vehicles in human-centered environments grows, their requirements go beyond simply navigating to a destination. They must also take into account dynamic social contexts and ensure respect and comfort for others in shared spaces, which poses significant challenges for perception and planning. In this paper, we present a group-based social navigation framework GSON to enable mobile robots to perceive and exploit the social group of their surroundings by leveling the visual reasoning capability of the Large Multimodal Model (LMM). For perception, we apply visual prompting techniques to zero-shot extract the social relationship among pedestrians and combine the result with a robust pedestrian detection and tracking pipeline to alleviate the problem of low inference speed of the LMM. Given the perception result, the planning system is designed to avoid disrupting the current social structure. We adopt a social structure-based mid-level planner as a bridge between global path planning and local motion planning to preserve the global context and reactive response. The proposed method is validated on real-world mobile robot navigation tasks involving complex social structure understanding and reasoning. Experimental results demonstrate the effectiveness of the system in these scenarios compared with several baselines.

What problem does this paper attempt to address?

The problem this paper attempts to address is that in human-centered environments, mobile robots and service robots need not only to avoid obstacles while navigating but also to understand and respect complex social structures to ensure they do not interfere with social interactions among people. Specifically, the paper focuses on how to enable robots to perceive and utilize social group information in their surroundings and, based on this, perform reasonable path planning and motion control. ### Background and Problem As the number of service robots and autonomous vehicles in human-centered environments increases, their requirements have surpassed merely reaching a destination. These robots must consider dynamic social contexts to ensure they respect others' comfort in shared spaces, posing significant challenges for perception and planning. Traditional methods often rely on predefined rules or domain-specific training data, which struggle to handle the complexity and diversity of open-world environments. Therefore, this paper proposes a Group-based Social Navigation framework (GSON), aiming to enable mobile robots to perceive and utilize social group information in their surroundings through the visual reasoning capabilities of large multimodal models (LMM). ### Solution 1. **Perception Module**: - **Pedestrian Detection and Tracking**: Combining 2D LiDAR and RGB camera data to achieve robust pedestrian detection and tracking. - **Social Group Detection**: Utilizing the zero-shot visual reasoning capabilities of LMM to extract social relationships between pedestrians from images and, combined with pedestrian detection and tracking results, generate social group estimates. 2. **Planning Module**: - **Global Path Planning**: Generating a global reference path. - **Mid-level Planning**: Serving as a bridge between global path planning and local motion planning, using social group estimates to generate intermediate reference paths that guide the local planner to avoid interfering with social groups. - **Local Motion Planning**: Combining Model Predictive Control (MPC) and Control Barrier Functions (CBF) to generate safe trajectories, ensuring the robot responds in real-time in dynamic environments. ### Experimental Validation The paper validates the effectiveness of GSON through both simulation and real-world experiments. The experiments include various daily social scenarios such as queuing, conversing, and taking photos. The results show that GSON outperforms baseline methods in reducing interference time with individuals and groups and maintaining a higher comfort distance, demonstrating its capability to navigate in complex social environments. ### Conclusion This paper proposes a novel approach that combines the visual reasoning capabilities of large multimodal models with a social structure perception and planning system, enabling mobile robots to perform socially aware navigation in human-centered environments. Extensive experiments demonstrate that this method outperforms existing methods in terms of performance and robustness, showcasing its potential to enhance the social awareness and interaction capabilities of autonomous robots. Future work can further expand this framework to more densely interactive social environments and explore distilling knowledge into smaller models to accelerate inference speed.

GSON: A Group-based Social Navigation Framework with Large Multimodal Model

ChatNav: Leveraging LLM to Zero-shot Semantic Reasoning in Object Navigation

Mapless Collaborative Navigation for a Multi-Robot System Based on the Deep Reinforcement Learning

ColAG: A Collaborative Air-Ground Framework for Perception-Limited UGVs' Navigation

CrowdMove: Autonomous Mapless Navigation in Crowded Scenarios

Following Social Groups: Socially Compliant Autonomous Navigation in Dense Crowds

Extracting Dynamic Navigation Goal from Natural Language Dialogue

Enhancing Socially-Aware Robot Navigation through Bidirectional Natural Language Conversation

Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation using Large Language Models

Rethinking Social Robot Navigation: Leveraging the Best of Two Worlds

Social navigation framework for assistive robots in human inhabited unknown environments

SG-LSTM: Social Group LSTM for Robot Navigation Through Dense Crowds

A Study on Learning Social Robot Navigation with Multimodal Perception

VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models

SACSoN: Scalable Autonomous Control for Social Navigation

CAMON: Cooperative Agents for Multi-Object Navigation with LLM-based Conversations

Toward Human-Like Social Robot Navigation: A Large-Scale, Multi-Modal, Social Human Navigation Dataset

Efficient Collaborative Navigation Through Perception Fusion for Multi-Robots in Unknown Environments

Learning World Transition Model for Socially Aware Robot Navigation

Language and Sketching: An LLM-driven Interactive Multimodal Multitask Robot Navigation Framework