Vision-Language-Action Model and Diffusion Policy Switching Enables Dexterous Control of an Anthropomorphic Hand

Cheng Pan,Kai Junge,Josie Hughes
2024-10-18
Abstract:To advance autonomous dexterous manipulation, we propose a hybrid control method that combines the relative advantages of a fine-tuned Vision-Language-Action (VLA) model and diffusion models. The VLA model provides language commanded high-level planning, which is highly generalizable, while the diffusion model handles low-level interactions which offers the precision and robustness required for specific objects and environments. By incorporating a switching signal into the training-data, we enable event based transitions between these two models for a pick-and-place task where the target object and placement location is commanded through language. This approach is deployed on our anthropomorphic ADAPT Hand 2, a 13DoF robotic hand, which incorporates compliance through series elastic actuation allowing for resilience for any interactions: showing the first use of a multi-fingered hand controlled with a VLA model. We demonstrate this model switching approach results in a over 80\% success rate compared to under 40\% when only using a VLA model, enabled by accurate near-object arm motion by the VLA model and a multi-modal grasping motion with error recovery abilities from the diffusion model.
Robotics,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to achieve autonomous dexterous manipulation, especially how to control the precise operation of human - like hands by combining the advantages of Vision - Language - Action (VLA) models and diffusion models. Specifically, the paper proposes a hybrid control method, aiming at: 1. **Utilizing VLA models**: Provide high - level planning based on language commands, which has a high generalization ability. 2. **Utilizing diffusion models**: Handle low - level interactions and provide the precision and robustness required for specific objects and environments. 3. **Achieving smooth switching between models**: By introducing a switching signal in the training data, enable the system to perform event - driven transitions between VLA models and diffusion models, thereby completing specified language - command tasks, such as grasping and placing objects. Through this method, the paper aims to improve the success rate and precision of multi - fingered human - like hands when performing complex tasks. Experimental results show that compared with using only VLA models, this hybrid method can significantly improve the task success rate, from less than 40% to more than 80%. This indicates that by combining the advantages of different models, the performance of robots in performing dexterous operation tasks can be effectively improved.