Abstract:Constructing simulation scenes that are both visually and physically realistic is a problem of practical interest in domains ranging from robotics to computer vision. This problem has become even more relevant as researchers wielding large data-hungry learning methods seek new sources of training data for physical decision-making systems. However, building simulation models is often still done by hand. A graphic designer and a simulation engineer work with predefined assets to construct rich scenes with realistic dynamic and kinematic properties. While this may scale to small numbers of scenes, to achieve the generalization properties that are required for data-driven robotic control, we require a pipeline that is able to synthesize large numbers of realistic scenes, complete with 'natural' kinematic and dynamic structures. To attack this problem, we develop models for inferring structure and generating simulation scenes from natural images, allowing for scalable scene generation from web-scale datasets. To train these image-to-simulation models, we show how controllable text-to-image generative models can be used in generating paired training data that allows for modeling of the inverse problem, mapping from realistic images back to complete scene models. We show how this paradigm allows us to build large datasets of scenes in simulation with semantic and physical realism. We present an integrated end-to-end pipeline that generates simulation scenes complete with articulated kinematic and dynamic structures from real-world images and use these for training robotic control policies. We then robustly deploy in the real world for tasks like articulated object manipulation. In doing so, our work provides both a pipeline for large-scale generation of simulation environments and an integrated system for training robust robotic control policies in the resulting environments.

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

EVA: An Embodied World Model for Future Video Anticipation

Active Vision for Robot Manipulators Using the Free Energy Principle

Amplifying robotics capacities with a human touch: An immersive low-latency panoramic remote system

URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images

FusionSense: Bridging Common Sense, Vision, and Touch for Robust Sparse-View Reconstruction

Spatially Visual Perception for End-to-End Robotic Learning

Efficient Bi-manipulation using RGBD Multi-model Fusion based on Attention Mechanism

Reality Fusion: Robust Real-time Immersive Mobile Robot Teleoperation with Volumetric Visual Data Fusion

Engram-Driven Videography

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

Configurable Embodied Data Generation for Class-Agnostic RGB-D Video Segmentation

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

Human-oriented Representation Learning for Robotic Manipulation

An Visual System for Humanoid Robot Mobile-Manipulation Based on Virtual and Real Video Fusion

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

A Flexible Framework for Virtual Omnidirectional Vision to Improve Operator Situation Awareness