Abstract:While there has been remarkable progress recently in the fields of manipulation and locomotion, mobile manipulation remains a long-standing challenge. Compared to locomotion or static manipulation, a mobile system must make a diverse range of long-horizon tasks feasible in unstructured and dynamic environments. While the applications are broad and interesting, there are a plethora of challenges in developing these systems such as coordination between the base and arm, reliance on onboard perception for perceiving and interacting with the environment, and most importantly, simultaneously integrating all these parts together. Prior works approach the problem using disentangled modular skills for mobility and manipulation that are trivially tied together. This causes several limitations such as compounding errors, delays in decision-making, and no whole-body coordination. In this work, we present a reactive mobile manipulation framework that uses an active visual system to consciously perceive and react to its environment. Similar to how humans leverage whole-body and hand-eye coordination, we develop a mobile manipulator that exploits its ability to move and see, more specifically -- to move in order to see and to see in order to move. This allows it to not only move around and interact with its environment but also, choose "when" to perceive "what" using an active visual system. We observe that such an agent learns to navigate around complex cluttered scenarios while displaying agile whole-body coordination using only ego-vision without needing to create environment maps. Results visualizations and videos at

What problem does this paper attempt to address?

The paper attempts to address the problem of achieving simultaneous perception, interaction, and navigation (SPIN) with a mobile manipulator in cluttered and unstructured environments. Specifically, the paper focuses on how to enable robots to perform multiple long-term tasks in unstructured dynamic environments, such as navigating through complex obstacles and grasping different objects. Traditional methods for mobility and manipulation often treat perception, planning, and obstacle avoidance as separate processes, leading to decision delays, error accumulation, and a lack of full-body coordination. This paper proposes a new framework that trains a single model through reinforcement learning, allowing the robot to control both its body and arm movements simultaneously and actively choose when and what to perceive, thereby achieving efficient navigation and operation in complex environments. The main contributions of the paper include: 1. **Proposing a new mobile manipulation framework**: This framework uses an active vision system, enabling the robot to consciously perceive and respond to the environment. 2. **Achieving full-body coordination**: The robot can utilize its mobility and visual capabilities not only to navigate around obstacles but also to choose "when" to perceive "what," thereby achieving efficient navigation and operation. 3. **Demonstrating performance in complex environments**: Experimental results show that this method outperforms traditional methods in both simulated and real-world environments, particularly excelling in handling dynamic obstacles. Through these contributions, the paper aims to address the challenges of mobile manipulation systems in unstructured and dynamic environments, advancing the development of robotics technology in practical applications.

SPIN: Simultaneous Perception, Interaction and Navigation

Dexterous Manoeuvre Through Touch in a Cluttered Scene

Situated Multimodal Control of a Mobile Robot: Navigation through a Virtual Environment

Interactive Navigation with Adaptive Non-prehensile Mobile Manipulation

Interactive Perception for Deformable Object Manipulation

Active-Perceptive Motion Generation for Mobile Manipulation

Active Perception and Representation for Robotic Manipulation

Learning Mobile Manipulation

Mobile Manipulation Leveraging Multiple Views

Language-guided Semantic Mapping and Mobile Manipulation in Partially Observable Environments

A Holistic Approach to Reactive Mobile Manipulation

N$^2$M$^2$: Learning Navigation for Arbitrary Mobile Manipulation Motions in Unseen and Dynamic Environments

MoDem-V2: Visuo-Motor World Models for Real-World Robot Manipulation

Multi-sensor Fusion for Interactive Visual Computing in Mixed Environment.

An Architecture for Reactive Mobile Manipulation On-The-Move

A Mobile Manipulation System for One-Shot Teaching of Complex Tasks in Homes

Hierarchical visuomotor control of humanoids

Dynamic Planning for Sequential Whole-body Mobile Manipulation

Predictive Multi-Agent-Based Planning and Landing Controller for Reactive Dual-Arm Manipulation

RGBManip: Monocular Image-based Robotic Manipulation through Active Object Pose Estimation

Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps