FLIP: Flow-Centric Generative Planning for General-Purpose Manipulation Tasks

Chongkai Gao,Haozhuo Zhang,Zhixuan Xu,Zhehao Cai,Lin Shao
2024-12-11
Abstract:We aim to develop a model-based planning framework for world models that can be scaled with increasing model and data budgets for general-purpose manipulation tasks with only language and vision inputs. To this end, we present FLow-centric generative Planning (FLIP), a model-based planning algorithm on visual space that features three key modules: 1. a multi-modal flow generation model as the general-purpose action proposal module; 2. a flow-conditioned video generation model as the dynamics module; and 3. a vision-language representation learning model as the value module. Given an initial image and language instruction as the goal, FLIP can progressively search for long-horizon flow and video plans that maximize the discounted return to accomplish the task. FLIP is able to synthesize long-horizon plans across objects, robots, and tasks with image flows as the general action representation, and the dense flow information also provides rich guidance for long-horizon video generation. In addition, the synthesized flow and video plans can guide the training of low-level control policies for robot execution. Experiments on diverse benchmarks demonstrate that FLIP can improve both the success rates and quality of long-horizon video plan synthesis and has the interactive world model property, opening up wider applications for future works.
Robotics,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper aims to develop a model - based planning framework to deal with general manipulation tasks that require only language and visual inputs. Specifically, the paper proposes a model - based planning algorithm named FLIP (Flow - Centric Generative Planning), which operates in the visual space and has three key modules: 1. **Multimodal flow generation model**: As a general action proposal module. 2. **Flow - conditioned video generation model**: As a dynamic module. 3. **Vision - language representation learning model**: As a value module. The goal of FLIP is to complete tasks by gradually searching for long - term flow and video plans to maximize the discounted return, given the initial image and the language instruction as the goal. FLIP can synthesize long - term plans across objects, robots and tasks, and the dense flow information also provides rich guidance for long - term video generation. In addition, the synthesized flow and video plans can guide the training of low - level control strategies for robot execution. The core problem of the paper is how to use flow (image flow) for planning to achieve general robotic arm manipulation tasks. Traditional methods either require additional datasets or high - level action annotation processes specific to a task to train an interactive world model, or their representations cannot describe complex and subtle actions in the scene. Therefore, the authors propose using image flow as an action representation because it can be fully obtained from a pure video dataset and can describe more subtle changes. ### Summary FLIP overcomes the limitations of existing methods in the following ways: - Using image flow as an action representation, avoiding the need for additional datasets or task - specific high - level action annotations. - Proposing a new flow generation network, a new flow - conditioned video generation network and a new training method for the existing vision - language representation learning network, thus constructing a scalable and efficient interactive world model. - Verifying the effectiveness of FLIP in simulated and real - world tasks through experiments, demonstrating its advantages in long - horizon video generation, guidance for low - level policy training, etc. In summary, the main contribution of this paper is to propose a flow - based generative planning framework (FLIP) for general robotic arm manipulation tasks and design three key modules, making it perform well in multiple benchmark tests.