Abstract:Large Vision-Language-Action (VLA) models, leveraging powerful pre trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, in static tasks, we double the control frequency and achieve comparable success rates. Additionally, on novel real-world dynamic ma nipulation tasks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to reduce the computational cost and inference latency of large Vision - Language - Action (VLA) models in robot control while maintaining high performance**. Specifically: 1. **High computational cost and inference latency issues**: - Large VLA models rely on pre - trained Vision - Language Models (VLM) with billions of parameters, which leads to high computational cost and inference latency. - Inference latency makes the robot's actions slow, prolongs the task completion time, and affects performance and safety in dynamic tasks (such as manipulating fast - moving objects). 2. **Control frequency limitations**: - Due to the heavy computational burden, existing VLA models can usually only handle quasi - static tasks when deployed and have difficulty dealing with dynamic tasks that require rapid interaction. - These models perform poorly in complex and changeable environments, especially in tasks that require quick responses. To solve these problems, the paper proposes HiRT (Hierarchical Robot Transformers), a hierarchical robot Transformer framework. HiRT solves the above problems in the following ways: - **Hierarchical strategy**: HiRT utilizes a hierarchical strategy, running VLM at a low frequency to capture temporally invariant features while performing real - time interaction through a high - frequency visual strategy. This design allows the system to significantly improve the inference speed while maintaining high performance. - **Inspired by dual - process theory**: Inspired by the dual - process theory of human cognition, HiRT combines "System 1" with fast, intuitive responses and "System 2" with slow, analytical planning. Among them, "System 2" is responsible for extracting high - level, slowly changing information, while "System 1" quickly responds to environmental changes through a lightweight model. Through these improvements, HiRT achieves a higher control frequency and similar success rate in static tasks and significantly improves the success rate in dynamic tasks. Experimental results show that in static tasks, HiRT doubles the control frequency and maintains a comparable success rate; in new real - world dynamic manipulation tasks, the success rate reaches 75%, which is a significant improvement compared to previous VLA models.

HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers

RT-1: Robotics Transformer for Real-World Control at Scale

TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust Attention

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

FLTRNN: Faithful Long-Horizon Task Planning for Robotics with Large Language Models

A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM

Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks

Vision-Language Foundation Models as Effective Robot Imitators

Interactive Visual Task Learning for Robots

Towards Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation: An Empirical Study

Research on Task Decomposition and Motion Trajectory Optimization of Robotic Arm Based on VLA Large Model

Actra: Optimized Transformer Architecture for Vision-Language-Action Models in Robot Learning

OpenVLA: An Open-Source Vision-Language-Action Model

Integrating Historical Learning and Multi-View Attention with Hierarchical Feature Fusion for Robotic Manipulation

QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

RT-H: Action Hierarchies Using Language

Proactive Human-Robot Interaction using Visuo-Lingual Transformers

RIRL: A Recurrent Imitation and Reinforcement Learning Method for Long-Horizon Robotic Tasks

Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration

MHRC: Closed-loop Decentralized Multi-Heterogeneous Robot Collaboration with Large Language Models