VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation

Wentao Zhao,Jiaming Chen,Ziyu Meng,Donghui Mao,Ran Song,Wei Zhang
2024-07-13
Abstract:Although Model Predictive Control (MPC) can effectively predict the future states of a system and thus is widely used in robotic manipulation tasks, it does not have the capability of environmental perception, leading to the failure in some complex scenarios. To address this issue, we introduce Vision-Language Model Predictive Control (VLMPC), a robotic manipulation framework which takes advantage of the powerful perception capability of vision language model (VLM) and integrates it with MPC. Specifically, we propose a conditional action sampling module which takes as input a goal image or a language instruction and leverages VLM to sample a set of candidate action sequences. Then, a lightweight action-conditioned video prediction model is designed to generate a set of future frames conditioned on the candidate action sequences. VLMPC produces the optimal action sequence with the assistance of VLM through a hierarchical cost function that formulates both pixel-level and knowledge-level consistence between the current observation and the goal image. We demonstrate that VLMPC outperforms the state-of-the-art methods on public benchmarks. More importantly, our method showcases excellent performance in various real-world tasks of robotic manipulation. Code is available at~\url{<a class="link-external link-https" href="https://github.com/PPjmchen/VLMPC" rel="external noopener nofollow">this https URL</a>}.
Robotics
What problem does this paper attempt to address?
The paper aims to address some key issues in robotic manipulation. Specifically: 1. **Lack of environmental awareness**: Traditional Model Predictive Control (MPC), while effective in predicting the future state of the system, lacks environmental awareness in complex scenarios, leading to failures in some cases. 2. **Limitations of predefined skills**: Early works based on Large Language Models (LLMs) can decompose natural language instructions into low-level operations, but these methods heavily rely on predefined individual skills, limiting the flexibility of robotic planning. 3. **Lack of foresight in planning**: Existing methods still face challenges when interacting with various objects and humans in the real world, especially in long-term planning in unseen environments. For example, in the task of opening a drawer, due to the lack of future prediction, existing methods find it difficult to directly generate precise trajectories. To address these issues, the paper proposes Vision-Language Model Predictive Control (VLMPC), a method that combines Vision-Language Models (VLMs) and Model Predictive Control. By leveraging the strong perceptual capabilities of VLMs and video prediction models, VLMPC achieves complex path planning, thereby enhancing the robot's operational capabilities in complex scenarios. VLMPC not only avoids the manual design of individual primitives but also overcomes the limitation of previous VLM-based methods that could only generate coarse trajectories without foresight.