Abstract:We report an improvements to NeurIPS 2023 HomeRobot: Open Vocabulary Mobile Manipulation (OVMM) Challenge reinforcement learning baseline. More specifically, we propose more accurate semantic segmentation module, along with better place skill policy, and high-level heuristic that outperforms the baseline by 2.4% of overall success rate (sevenfold improvement) and 8.2% of partial success rate (1.75 times improvement) on Test Standard split of the challenge dataset. With aforementioned enhancements incorporated our agent scored 3rd place in the challenge on both simulation and real-world stages.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve open - vocabulary mobile manipulation tasks (OVMM) in a home environment. Specifically, the research team aims to improve the robot's navigation ability in unknown environments, find specified objects, and move them to specified target containers. The focus of the paper is on improving the baseline method based on reinforcement learning, enhancing the overall success rate and partial success rate by enhancing the semantic segmentation module, optimizing the placement - skill strategy, and introducing advanced heuristic algorithms. ### Main Contributions 1. **Improvement of the Semantic Segmentation Module**: - Using the retrained YOLOv8 object detection model and MobileSAM segmentation model, the understanding ability of the environment is improved. - Combining with the Detic perception module, more accurate semantic segmentation masks are generated, especially in identifying small objects and furniture types. 2. **Optimization of the Placement - Skill Strategy**: - By analyzing the placement - skill performance of the baseline method, the existing bottlenecks are identified. - The reward function is adjusted to better guide the agent's behavior when placing objects, reducing unstable placements and cases of missing the target container. 3. **Introduction of Advanced Heuristic Algorithms**: - A more complex high - level strategy is designed, and through conditional loops, it is ensured that subsequent tasks are not carried out before successfully grasping the object. - The success rates of navigation and placement tasks are improved, especially in the case of partial success. ### Experimental Results - On the test standard data set, the improved agent has a 7 - fold increase in the overall success rate (from 0.4% to 2.8%) and a 1.75 - fold increase in the partial success rate (from 10.9% to 19.1%). - In virtual and real - world competitions, the agent has achieved the third - place respectively. ### Future Work - **Object Tracker**: By introducing an object tracker, the problem of object disappearance in consecutive frames is prevented, and the stability of the agent during navigation and manipulation is improved. - **Improvement of Strategies and Skills**: Further optimize the training of high - level strategies and individual skills to improve the overall performance of the agent. - **World Representation**: By constructing the world representation of the environment, storing known object and path information, the exploration process is optimized, and unnecessary repeated exploration is reduced. ### Summary Although significant improvements have been made, the current method has not yet fully solved the OVMM task. Future work needs to continue efforts in semantic segmentation, object tracking, strategy optimization, etc., to further improve the performance of robots.

HomeRobot Open Vocabulary Mobile Manipulation Challenge 2023 Participant Report (Team KuzHum)

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

Towards Open-World Mobile Manipulation in Homes: Lessons from the Neurips 2023 HomeRobot Open Vocabulary Mobile Manipulation Challenge

Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps

Balancing Performance and Efficiency in Zero-shot Robotic Navigation

OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics

Open-World Object Manipulation using Pre-trained Vision-Language Models

BUMBLE: Unifying Reasoning and Acting with Vision-Language Models for Building-wide Mobile Manipulation

Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

Continuously Improving Mobile Manipulation with Autonomous Real-World RL

DeMoBot: Deformable Mobile Manipulation with Vision-based Sub-goal Retrieval

Harmonic Mobile Manipulation

A Multi-modal Framework for Robots to Learn Manipulation Tasks from Human Demonstrations

The 2nd Place Solution for 2023 Waymo Open Sim Agents Challenge

Object-Centric Instruction Augmentation for Robotic Manipulation

The Robotic Vision Scene Understanding Challenge

Solving Service Robot Tasks: UT Austin Villa@Home 2019 Team Report

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation

Spatial-Language Attention Policies for Efficient Robot Learning

Autonomous Improvement of Instruction Following Skills via Foundation Models