Abstract:Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale, diverse datasets significantly enhances navigation performance, surpassing current methods. This work shows the potential of using abundant online video data to develop robust navigation policies for embodied agents in dynamic urban settings. <a class="link-external link-https" href="https://ai4ce.github.io/CityWalker/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to make embodied agents navigate as efficiently and safely as humans in dynamic urban environments. Specifically, the existing visual navigation methods perform poorly in the absence of maps or in non - street environments, which limits the practical deployment of autonomous agents (such as last - mile delivery robots). To overcome these obstacles, the authors propose a scalable data - driven method. By training the model to process thousands of hours of online urban walking and driving videos, human - like urban navigation is achieved. ### Problem Background 1. **Challenges of Embodied Urban Navigation**: - Dynamic urban environments pose significant challenges to embodied agents, requiring advanced spatial reasoning abilities and compliance with common - sense norms. - Existing visual navigation methods perform poorly in map - free or non - street settings, limiting their scope of application. - Application scenarios such as self - driving robots and unmanned taxis require agents to be able to navigate efficiently in complex and changeable urban environments. 2. **Limitations of Existing Methods**: - Reinforcement learning and imitation learning methods usually perform well in static or controlled environments, but have difficulty dealing with complex rules and social norms in real - world urban navigation. - The method of collecting expert trajectory data is limited by the amount and diversity of data, weakening the generalization ability of agents in various urban scenarios. ### Proposed Solution The authors propose a framework named CityWalker, which is trained using large - scale online urban walking and driving videos. The main contributions of this method include: 1. **Scalable Data Processing Pipeline**: - By extracting action supervision from videos, large - scale imitation learning becomes possible without expensive manual annotation. - Use off - the - shelf visual odometry (VO) models to generate noisy pseudo - labels to reduce annotation costs and improve scalability. 2. **Improved Navigation Performance**: - Experimental results show that training on large - scale and diverse datasets can significantly improve navigation performance, surpassing existing methods. - The model can handle multiple challenges and key scenarios, such as obstacle avoidance, traffic light recognition, etc. 3. **Cross - Domain and Cross - Entity Generalization**: - The trained model is not only applicable to walking data, but can also be applied to different entities such as quadruped robots, and shows better performance in cross - domain settings that combine urban walking and driving data. ### Summary CityWalker aims to train embodied agents to achieve efficient and safe navigation in complex and changeable urban environments by using large - scale online video data. This research demonstrates the potential of using rich online video data to develop robust navigation strategies, addressing the deficiencies of existing methods in dynamic urban navigation.

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

CrowdMove: Autonomous Mapless Navigation in Crowded Scenarios

Unsupervised Reinforcement Learning of Transferable Meta-Skills for Embodied Navigation

Demo Abstract: Embodied Aerial Agent for City-level Visual Language Navigation Using Large Language Model

Traversability-Aware Legged Navigation by Learning from Real-World Visual Data

NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation

X-MOBILITY: End-To-End Generalizable Navigation via World Modeling

Deep Understanding of Urban Mobility from CityscapeWebcams

Learning to Navigate Sidewalks in Outdoor Environments

Learning Semantic Traversability with Egocentric Video and Automated Annotation Strategy

Learning Deployable Navigation Policies at Kilometer Scale from a Single Traversal

Toward Human-Like Social Robot Navigation: A Large-Scale, Multi-Modal, Social Human Navigation Dataset

Deep Learning for Embodied Vision Navigation: A Survey

Enhancing Socially-Aware Robot Navigation through Bidirectional Natural Language Conversation

Learning Exploration Policies for Navigation

Language-guided Robust Navigation for Mobile Robots in Dynamically-changing Environments

BEVNav: Robot Autonomous Navigation Via Spatial-Temporal Contrastive Learning in Bird's-Eye View

L3MVN: Leveraging Large Language Models for Visual Target Navigation

Autonomous social robot navigation in unknown urban environments using semantic segmentation

Vision Based Sidewalk Navigation for Last-mile Delivery Robot

See What the Robot Can't See: Learning Cooperative Perception for Visual Navigation