HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers

Jianke Zhang,Yanjiang Guo,Xiaoyu Chen,Yen-Jen Wang,Yucheng Hu,Chengming Shi,Jianyu Chen
2024-10-21
Abstract:Large Vision-Language-Action (VLA) models, leveraging powerful pre trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, in static tasks, we double the control frequency and achieve comparable success rates. Additionally, on novel real-world dynamic ma nipulation tasks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.
Computer Vision and Pattern Recognition,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to reduce the computational cost and inference latency of large Vision - Language - Action (VLA) models in robot control while maintaining high performance**. Specifically: 1. **High computational cost and inference latency issues**: - Large VLA models rely on pre - trained Vision - Language Models (VLM) with billions of parameters, which leads to high computational cost and inference latency. - Inference latency makes the robot's actions slow, prolongs the task completion time, and affects performance and safety in dynamic tasks (such as manipulating fast - moving objects). 2. **Control frequency limitations**: - Due to the heavy computational burden, existing VLA models can usually only handle quasi - static tasks when deployed and have difficulty dealing with dynamic tasks that require rapid interaction. - These models perform poorly in complex and changeable environments, especially in tasks that require quick responses. To solve these problems, the paper proposes HiRT (Hierarchical Robot Transformers), a hierarchical robot Transformer framework. HiRT solves the above problems in the following ways: - **Hierarchical strategy**: HiRT utilizes a hierarchical strategy, running VLM at a low frequency to capture temporally invariant features while performing real - time interaction through a high - frequency visual strategy. This design allows the system to significantly improve the inference speed while maintaining high performance. - **Inspired by dual - process theory**: Inspired by the dual - process theory of human cognition, HiRT combines "System 1" with fast, intuitive responses and "System 2" with slow, analytical planning. Among them, "System 2" is responsible for extracting high - level, slowly changing information, while "System 1" quickly responds to environmental changes through a lightweight model. Through these improvements, HiRT achieves a higher control frequency and similar success rate in static tasks and significantly improves the success rate in dynamic tasks. Experimental results show that in static tasks, HiRT doubles the control frequency and maintains a comparable success rate; in new real - world dynamic manipulation tasks, the success rate reaches 75%, which is a significant improvement compared to previous VLA models.