Abstract:Modeling 3D human-object interaction (HOI) is a problem of great interest for computer vision and a key enabler for virtual and mixed-reality applications. Existing methods work in a one-way direction: some recover plausible human interactions conditioned on a 3D object; others recover the object pose conditioned on a human pose. Instead, we provide the first unified model - TriDi which works in any direction. Concretely, we generate Human, Object, and Interaction modalities simultaneously with a new three-way diffusion process, allowing to model seven distributions with one network. We implement TriDi as a transformer attending to the various modalities' tokens, thereby discovering conditional relations between them. The user can control the interaction either as a text description of HOI or a contact map. We embed these two representations into a shared latent space, combining the practicality of text descriptions with the expressiveness of contact maps. Using a single network, TriDi unifies all the special cases of prior work and extends to new ones, modeling a family of seven distributions. Remarkably, despite using a single model, TriDi generated samples surpass one-way specialized baselines on GRAB and BEHAVE in terms of both qualitative and quantitative metrics, and demonstrating better diversity. We show the applicability of TriDi to scene population, generating objects for human-contact datasets, and generalization to unseen object geometry. The project page is available at: <a class="link-external link-https" href="https://virtualhumans.mpi-inf.mpg.de/tridi" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to model 3D human - object interactions (HOI), which is an important issue in the field of computer vision and a key enabler for virtual and mixed - reality applications. Existing methods usually work in only one direction: some methods recover reasonable human interactions based on 3D objects, while others recover the pose of the object based on the human pose. However, these methods are all limited to specific conditional situations and lack flexibility and generality. Specifically, the paper proposes a new joint probability model - TriDi, which can simultaneously generate three modalities of human, object, and interaction, and unify these three modalities through a new three - way diffusion process. This method not only covers various use cases that were treated in isolation in previous work, but also allows sampling under seven different conditional configurations, thus greatly improving the flexibility and applicability of the model. ### Main problems and solutions 1. **Limitations of existing methods**: - Most existing methods can only work in a single direction, for example, predicting human pose from an object or predicting an object from human pose. - These methods need to design specialized models for each conditional situation, resulting in high model complexity and difficulty in expansion. 2. **Innovations of TriDi**: - **Unified joint model**: TriDi is the first to propose a joint model that can work bi - directionally or even multi - directionally and can handle multiple combinations of human, object, and interaction simultaneously. - **Three - way diffusion process**: By introducing a three - way diffusion process, TriDi can simulate seven distributions in one network, thus simplifying the model design and improving efficiency. - **Flexible conditional control**: Users can control the interaction through text descriptions or contact maps, combining the practicality of text descriptions and the expressiveness of contact maps. ### Specific contributions of the paper - **First proposed joint model**: TriDi is the first joint model that can handle human, object, and interaction simultaneously, covering a total of 7 operation modes, making many previous works its special cases. - **Novel interaction representation**: By embedding body contact maps and text descriptions into a shared latent space, it provides an intuitive and detailed way of representing interactions. - **Open - source code**: The authors promise to release the code, providing the community with tools for tasks such as scene completion and generation from partial observations. ### Summary The core problem of the paper is to improve the flexibility and generality of 3D human - object interaction modeling. By proposing the TriDi joint model, the problem that existing methods are limited to specific conditional situations is solved, significantly enhancing the diversity and applicability of the model.

TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions

Reconstructing Three-Dimensional Models of Interacting Humans

HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects

Di^2Pose: Discrete Diffusion Model for Occluded 3D Human Pose Estimation

InterFusion: Text-Driven Generation of 3D Human-Object Interaction

Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models

Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization.

InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction

HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

Human-Object Interaction Detection via Disentangled Transformer

OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains

HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment

Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views

InterMask: 3D Human Interaction Generation via Collaborative Masked Modelling

CG-HOI: Contact-Guided 3D Human-Object Interaction Generation

Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

Interaction Replica: Tracking Human-Object Interaction and Scene Changes From Human Motion

TMHOI: Translational Model for Human-Object Interaction Detection

THOR: Text to Human-Object Interaction Diffusion via Relation Intervention

DiHuR: Diffusion-Guided Generalizable Human Reconstruction