CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

Xinhao Liu,Jintong Li,Yichen Jiang,Niranjan Sujay,Zhicheng Yang,Juexiao Zhang,John Abanes,Jing Zhang,Chen Feng
2024-11-27
Abstract:Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale, diverse datasets significantly enhances navigation performance, surpassing current methods. This work shows the potential of using abundant online video data to develop robust navigation policies for embodied agents in dynamic urban settings. <a class="link-external link-https" href="https://ai4ce.github.io/CityWalker/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to make embodied agents navigate as efficiently and safely as humans in dynamic urban environments. Specifically, the existing visual navigation methods perform poorly in the absence of maps or in non - street environments, which limits the practical deployment of autonomous agents (such as last - mile delivery robots). To overcome these obstacles, the authors propose a scalable data - driven method. By training the model to process thousands of hours of online urban walking and driving videos, human - like urban navigation is achieved. ### Problem Background 1. **Challenges of Embodied Urban Navigation**: - Dynamic urban environments pose significant challenges to embodied agents, requiring advanced spatial reasoning abilities and compliance with common - sense norms. - Existing visual navigation methods perform poorly in map - free or non - street settings, limiting their scope of application. - Application scenarios such as self - driving robots and unmanned taxis require agents to be able to navigate efficiently in complex and changeable urban environments. 2. **Limitations of Existing Methods**: - Reinforcement learning and imitation learning methods usually perform well in static or controlled environments, but have difficulty dealing with complex rules and social norms in real - world urban navigation. - The method of collecting expert trajectory data is limited by the amount and diversity of data, weakening the generalization ability of agents in various urban scenarios. ### Proposed Solution The authors propose a framework named CityWalker, which is trained using large - scale online urban walking and driving videos. The main contributions of this method include: 1. **Scalable Data Processing Pipeline**: - By extracting action supervision from videos, large - scale imitation learning becomes possible without expensive manual annotation. - Use off - the - shelf visual odometry (VO) models to generate noisy pseudo - labels to reduce annotation costs and improve scalability. 2. **Improved Navigation Performance**: - Experimental results show that training on large - scale and diverse datasets can significantly improve navigation performance, surpassing existing methods. - The model can handle multiple challenges and key scenarios, such as obstacle avoidance, traffic light recognition, etc. 3. **Cross - Domain and Cross - Entity Generalization**: - The trained model is not only applicable to walking data, but can also be applied to different entities such as quadruped robots, and shows better performance in cross - domain settings that combine urban walking and driving data. ### Summary CityWalker aims to train embodied agents to achieve efficient and safe navigation in complex and changeable urban environments by using large - scale online video data. This research demonstrates the potential of using rich online video data to develop robust navigation strategies, addressing the deficiencies of existing methods in dynamic urban navigation.