Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations

Koffivi Fidèle Gbagbe,Miguel Altamirano Cabrera,Ali Alabbas,Oussama Alyunes,Artem Lykov,Dzmitry Tsetserukou
2024-08-19
Abstract:This research introduces the Bi-VLA (Vision-Language-Action) model, a novel system designed for bimanual robotic dexterous manipulation that seamlessly integrates vision for scene understanding, language comprehension for translating human instructions into executable code, and physical action generation. We evaluated the system's functionality through a series of household tasks, including the preparation of a desired salad upon human request. Bi-VLA demonstrates the ability to interpret complex human instructions, perceive and understand the visual context of ingredients, and execute precise bimanual actions to prepare the requested salad. We assessed the system's performance in terms of accuracy, efficiency, and adaptability to different salad recipes and human preferences through a series of experiments. Our results show a 100% success rate in generating the correct executable code by the Language Module, a 96.06% success rate in detecting specific ingredients by the Vision Module, and an overall success rate of 83.4% in correctly executing user-requested tasks.
Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to develop a robotic system capable of complex bimanual dexterous manipulation. This system can understand human instructions, perceive the visual environment, and perform precise bimanual actions. Specifically, the paper introduces a new model named Bi - VLA (Vision - Language - Action), aiming to achieve seamless bimanual robotic dexterous manipulation by integrating visual scene understanding, language instruction parsing, and physical action generation. The paper evaluates the functionality of the system through a series of household tasks (such as preparing a specific salad according to human requests), with a focus on examining the system's accuracy, efficiency, and adaptability to different salad recipes and personal preferences. The main contributions of the paper are as follows: 1. **Multimodal Fusion**: The Bi - VLA model organically combines the three key modules of vision, language, and action, achieving end - to - end processing from human natural - language instructions to precise robot actions. 2. **High - Performance Performance**: Experimental results show that Bi - VLA performs excellently in generating correct executable code, detecting specific ingredients, and performing tasks requested by users, achieving success rates of 100%, 96.06%, and 83.4% respectively. 3. **Practical Application Potential**: Through specific cooking experiments, the application potential of this system in real - life, especially in the fields of home automation and assisted living, is demonstrated. In conclusion, this paper aims to solve the current challenges faced by bimanual robots in understanding and executing complex tasks through the Bi - VLA model and promote the development of human - machine interaction technology.