Abstract:Creating dynamic virtual environments consisting of humans interacting with objects is a fundamental problem in computer graphics. While it is well‐accepted that agent interactions play an essential role in synthesizing such scenes, most extant techniques exclusively focus on static scenes, leaving the dynamic component out. In this paper, we present a generative model to synthesize plausible multi‐step dynamic human‐object interactions. Generating multi‐step interactions is challenging since the space of such interactions is exponential in the number of objects, activities, and time steps. We propose to handle this combinatorial complexity by learning a lower dimensional space of plausible human‐object interactions. We use action plots to represent interactions as a sequence of discrete actions along with the participating objects and their states. To build action plots, we present an automatic method that uses state‐of‐the‐art computer vision techniques on RGB videos in order to detect individual objects and their states, extract the involved hands, and recognize the actions performed. The action plots are built from observing videos of everyday activities and are used to train a generative model based on a Recurrent Neural Network (RNN). The network learns the causal dependencies and constraints between individual actions and can be used to generate novel and diverse multi‐step human‐object interactions. Our representation and generative model allows new capabilities in a variety of applications such as interaction prediction, animation synthesis, and motion planning for a real robotic agent.

GATSBI: Generative Agent-centric Spatio-temporal Object Interaction

Scene Dynamics: Counterfactual Critic Multi-Agent Training for Scene Graph Generation.

GATS: Gather-Attend-Scatter

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

GATSBI: Generative Adversarial Training for Simulation-Based Inference

Fast Contextual Scene Graph Generation with Unbiased Context Augmentation.

Local-Global Information Interaction Debiasing for Dynamic Scene Graph Generation

Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks

Learning a Generative Model for Multi‐Step Human‐Object Interactions from Videos

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction.

Embodied Semantic Scene Graph Generation.

Modeling Dynamic Environments with Scene Graph Memory

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

Spatio-Temporal Graph Dual-Attention Network for Multi-Agent Prediction and Tracking

Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

Temporally Consistent Dynamic Scene Graphs: An End-to-End Approach for Action Tracklet Generation

TSGN: Temporal Scene Graph Neural Networks with Projected Vectorized Representation for Multi-Agent Motion Prediction

Generating Multi-Agent Trajectories using Programmatic Weak Supervision

GSSTU: Generative Spatial Self-Attention Transformer Unit for Enhanced Video Prediction

Dynamic Scene Graph Generation Via Temporal Prior Inference