Abstract:A fundamental objective in robot manipulation is to enable models to comprehend visual scenes and execute actions. Although existing Vision-Language-Action (VLA) models for robots can handle a range of basic tasks, they still face challenges in two areas: (1) insufficient reasoning ability to tackle complex tasks, and (2) high computational costs for VLA model fine-tuning and inference. The recently proposed state space model (SSM) known as Mamba demonstrates promising capabilities in non-trivial sequence modeling with linear inference complexity. Inspired by this, we introduce RoboMamba, an end-to-end robotic VLA model that leverages Mamba to deliver both robotic reasoning and action capabilities, while maintaining efficient fine-tuning and inference. Specifically, we first integrate the vision encoder with Mamba, aligning visual tokens with language embedding through co-training, empowering our model with visual common sense and robotic-related reasoning. To further equip RoboMamba with SE(3) pose prediction abilities, we explore an efficient fine-tuning strategy with a simple policy head. We find that once RoboMamba possesses sufficient reasoning capability, it can acquire manipulation skills with minimal fine-tuning parameters (0.1\% of the model) and time. In experiments, RoboMamba demonstrates outstanding reasoning capabilities on general and robotic evaluation benchmarks. Meanwhile, our model showcases impressive pose prediction results in both simulation and real-world experiments, achieving inference speeds 3 times faster than existing VLA models. Our project web page: <a class="link-external link-https" href="https://sites.google.com/view/robomamba-web" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

This paper attempts to solve the following two main problems: 1. **Insufficient reasoning ability of robots in handling complex tasks**: Although the existing Vision - Language - Action (VLA) models can handle a series of basic tasks, their reasoning ability is still limited when facing complex tasks. This makes it difficult for robots to understand complex scenes and perform corresponding operations. 2. **High computational cost of VLA model fine - tuning and reasoning**: The existing VLA models require high - computational resources when performing fine - tuning and reasoning, resulting in low efficiency. This poses a challenge to real - time response in practical applications. To solve these problems, the paper introduces the RoboMamba model. By combining the Mamba language model (a state - space model with linear reasoning complexity, SSM), RoboMamba aims to provide efficient robot reasoning and operation capabilities while maintaining a low computational cost. Specifically, the main contributions of RoboMamba include: - **Efficient fusion of visual encoder and Mamba LLM**: Through alignment training and instruction co - training in the pre - training stage, the model is equipped with visual common sense and robot - related reasoning abilities. - **Efficient pose prediction fine - tuning strategy**: By introducing a simple policy head, pose prediction ability can be achieved with only a small number of parameters (0.1% of the model parameters) and time, thereby significantly reducing the cost of fine - tuning. - **Excellent reasoning and pose prediction performance**: Experimental results show that RoboMamba performs well on general and robot evaluation benchmarks, and shows impressive pose prediction results in simulation and real - world experiments, with a reasoning speed three times faster than existing VLA models. In summary, RoboMamba aims to develop a model that has strong reasoning ability and can complete robot operation tasks in an economical and efficient manner.

RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation

TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models

A Survey on Vision Mamba: Models, Applications and Challenges

Visual Mamba: A Survey and New Outlooks

Vision-Language Foundation Models as Effective Robot Imitators

Vision Mamba: A Comprehensive Survey and Taxonomy

VideoMamba: State Space Model for Efficient Video Understanding

MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting

VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation

RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation

Towards Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation: An Empirical Study

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

GraspMamba: A Mamba-based Language-driven Grasp Detection Framework with Hierarchical Feature Learning

A Survey on Visual Mamba

Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks

Demystify Mamba in Vision: A Linear Attention Perspective

AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation