Innovative Integration of Visual Foundation Model with a Robotic Arm on a Mobile Platform

Shimian Zhang,Qiuhong Lu
2024-04-29
Abstract:In the rapidly advancing field of robotics, the fusion of state-of-the-art visual technologies with mobile robotic arms has emerged as a critical integration. This paper introduces a novel system that combines the Segment Anything model (SAM) -- a transformer-based visual foundation model -- with a robotic arm on a mobile platform. The design of integrating a depth camera on the robotic arm's end-effector ensures continuous object tracking, significantly mitigating environmental uncertainties. By deploying on a mobile platform, our grasping system has an enhanced mobility, playing a key role in dynamic environments where adaptability are critical. This synthesis enables dynamic object segmentation, tracking, and grasping. It also elevates user interaction, allowing the robot to intuitively respond to various modalities such as clicks, drawings, or voice commands, beyond traditional robotic systems. Empirical assessments in both simulated and real-world demonstrate the system's capabilities. This configuration opens avenues for wide-ranging applications, from industrial settings, agriculture, and household tasks, to specialized assignments and beyond.
Robotics
What problem does this paper attempt to address?
This paper proposes a solution to the challenges faced by traditional visual integration on mobile manipulators. Existing systems have reduced performance when dealing with unknown objects and limited response capability to natural language instructions. To address this, the paper introduces an innovative system that combines a Transformer-based visual backbone model called the Segment Anything Model (SAM) with a robotic arm on a mobile platform. The key feature of the system is the installation of a depth camera at the end of the robotic arm to enable continuous object tracking and reduce environmental uncertainties. With the mobility provided by the platform, the system achieves enhanced maneuverability for grasping tasks in dynamic environments. The system features include: 1. **General object recognition**: It can recognize various objects without the need for frequent retraining, reducing costs. 2. **Enhanced human interaction**: Users can interact with the robot intuitively through various methods such as clicking, drawing, or voice commands. 3. **"Eye-in-hand" system**: The built-in visual system enables precise closed-loop control, ensuring continuous localization and optimal grasping in real-time. 4. **Integration on a mobile platform**: It expands the operational range, allowing the robot to navigate and perform strategic grasping in complex environments. The paper also discusses the comparison with existing work, such as using SAM for robot grasping, emphasizing the advantages of the "eye-in-hand" system and the mobile platform. The experimental section presents the results of simulations and real-world tests, demonstrating the effectiveness and broad potential applications of the system in various fields, including industrial manufacturing, consumer environments, and special scenarios. Future work will focus on further optimizing the grasping algorithm by utilizing the detailed contour information provided by the visual backbone model, aiming to improve grasping accuracy and reduce potential damage to sensitive items. Additionally, the research aims to reduce dependency on GPUs and achieve real-time segmentation.