Abstract:Recent advances in imitation learning for 3D robotic manipulation have shown promising results with diffusion-based policies. However, achieving human-level dexterity requires seamless integration of geometric precision and semantic understanding. We present G3Flow, a novel framework that constructs real-time semantic flow, a dynamic, object-centric 3D semantic representation by leveraging foundation models. Our approach uniquely combines 3D generative models for digital twin creation, vision foundation models for semantic feature extraction, and robust pose tracking for continuous semantic flow updates. This integration enables complete semantic understanding even under occlusions while eliminating manual annotation requirements. By incorporating semantic flow into diffusion policies, we demonstrate significant improvements in both terminal-constrained manipulation and cross-object generalization. Extensive experiments across five simulation tasks show that G3Flow consistently outperforms existing approaches, achieving up to 68.3% and 50.1% average success rates on terminal-constrained manipulation and cross-object generalization tasks respectively. Our results demonstrate the effectiveness of G3Flow in enhancing real-time dynamic semantic feature understanding for robotic manipulation policies.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve a seamless combination of high - precision geometric control and semantic understanding in robot manipulation, so as to improve the manipulation ability and generalization ability of robots in complex tasks. Specifically, although the existing geometric - based robot imitation learning methods perform well in capturing geometric information, they have limitations when dealing with tasks that require fine - grained spatial control and semantic understanding, especially when facing occluded or geometrically similar but semantically different objects. In addition, these methods usually require manual annotation of reference objects and it is difficult to maintain semantic consistency during dynamic interaction processes. To solve these problems, the paper proposes G3Flow, a new foundation - model - driven method that constructs a dynamic, object - centered complete 3D semantic representation through real - time semantic flow. G3Flow combines 3D generative models, visual foundation models and robust pose - tracking models to eliminate the need for manual annotation while maintaining continuous semantic understanding throughout the manipulation process. This method not only improves the success rate of end - constraint tasks but also shows significant advantages in cross - object generalization tasks. ### Main contributions of the paper: 1. **Propose a new foundation - model - driven method** for constructing semantic flow, which is a dynamic and complete semantic representation. By integrating 3D generation, detection and pose - tracking models, it achieves real - time understanding and can maintain consistency even in the case of occlusion without manual annotation. 2. **Develop an imitation - learning framework based on semantic flow** that uses dynamic semantic representation to enhance manipulation, achieving precise end - point control and effective object - variant generalization. 3. **Verify through extensive experiments** that semantic flow significantly enhances the imitation - learning strategy, achieving success rates of 68.3% and 50.1% in end - constraint tasks and cross - object generalization tasks respectively, which are significantly better than existing methods. ### Experimental results: - **End - constraint tasks**: In four tasks, namely shoe placement, double - shoe placement, tool adjustment and bottle adjustment, the success rate of G3Flow in all tasks is significantly higher than that of the baseline methods. In particular, in the shoe placement task, the success rate in the correct direction is increased by more than 25%; in the bottle adjustment task, the success rate of upright picking is on average more than 38% higher than that of the baseline methods. - **Generalization performance**: In four tasks, namely shoe placement, double - shoe placement, diverse - bottle picking and tool adjustment, the average success rate of G3Flow is 18.4% higher than that of the strongest baseline algorithm, demonstrating its strong generalization ability across different object categories and variants. Through these contributions, G3Flow provides a new solution for precise control and generalization ability in robot manipulation, promoting the development of robot imitation learning.

G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation

General Flow as Foundation Affordance for Scalable Robot Learning

FlowPolicy: Enabling Fast and Robust 3D Flow-based Policy via Consistency Flow Matching for Robot Manipulation

Flow as the Cross-Domain Manipulation Interface

Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching

GMFlow: Global Motion-Guided Recurrent Flow for 6D Object Pose Estimation

FlowBot3D: Learning 3D Articulation Flow to Manipulate Articulated Objects

GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy

GenFlow: Generalizable Recurrent Flow for 6D Pose Refinement of Novel Objects

FLIP: Flow-Centric Generative Planning for General-Purpose Manipulation Tasks

D^3FlowSLAM: Self-Supervised Dynamic SLAM with Flow Motion Decomposition and DINO Guidance

Semantic Flow: Learning Semantic Field of Dynamic Scenes from Monocular Videos

Learning Generalizable 3D Manipulation With 10 Demonstrations

FlowBot++: Learning Generalized Articulated Objects Manipulation via Articulation Projection

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Affordance-based Robot Manipulation with Flow Matching

SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow

Indexicality, intensionality, and relativist post-semantics

Let Occ Flow: Self-Supervised 3D Occupancy Flow Prediction

FlowBotHD: History-Aware Diffuser Handling Ambiguities in Articulated Objects Manipulation

Training Free Guided Flow Matching with Optimal Control