HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects

Xintao Lv,Liang Xu,Yichao Yan,Xin Jin,Congsheng Xu,Shuwen Wu,Yifan Liu,Lincheng Li,Mengxiao Bi,Wenjun Zeng,Xiaokang Yang

2024-09-11

Abstract:Generating human-object interactions (HOIs) is critical with the tremendous advances of digital avatars. Existing datasets are typically limited to humans interacting with a single object while neglecting the ubiquitous manipulation of multiple objects. Thus, we propose HIMO, a large-scale MoCap dataset of full-body human interacting with multiple objects, containing 3.3K 4D HOI sequences and 4.08M 3D HOI frames. We also annotate HIMO with detailed textual descriptions and temporal segments, benchmarking two novel tasks of HOI synthesis conditioned on either the whole text prompt or the segmented text prompts as fine-grained timeline control. To address these novel tasks, we propose a dual-branch conditional diffusion model with a mutual interaction module for HOI synthesis. Besides, an auto-regressive generation pipeline is also designed to obtain smooth transitions between HOI segments. Experimental results demonstrate the generalization ability to unseen object geometries and temporal compositions.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of existing data sets and models in generating Human - Object Interactions (HOIs), especially ignoring the complexity and diversity of human interactions with multiple objects. Specifically: 1. **Limitations of single - object interactions**: Most of the existing data sets and models are limited to human - single - object interactions, ignoring the common multi - object interaction scenarios in daily life. Such limitations have led to insufficient research and development of multi - object interaction synthesis. 2. **Lack of detailed temporal segmentation and text descriptions**: Existing data sets usually do not provide detailed text descriptions and temporal segmentation annotations, which makes it difficult to perform fine - grained timeline control and multi - step human - object interaction synthesis. To solve these problems, the author proposes HIMO (Human Interacting with Multiple Objects), which is a large - scale 4D HOI data set, containing 3,376 full - body interaction sequences and 4,080,000 3D HOI frames. The characteristics of the HIMO data set include: - **Multi - object interactions**: In each sequence, humans interact with multiple daily objects. - **Detailed text descriptions and temporal segmentation**: Each long - interaction sequence is finely segmented into multiple time segments and is accompanied by detailed text descriptions, facilitating the generation of more complex interaction actions. In addition, the paper also proposes two new tasks: 1. **Text - based HOI synthesis (HIMO - Gen)**: Generate human - multiple - object interaction actions according to the entire text prompt. 2. **Segmented - text - based timeline - controlled HOI synthesis (HIMO - SegGen)**: Generate multi - step interaction actions with smooth transitions according to the segmented - text prompt. Through these improvements, the HIMO data set and the proposed model aim to promote the research of multi - object interaction synthesis and improve the authenticity and harmony of the generated results.

HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects

HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains

CG-HOI: Contact-Guided 3D Human-Object Interaction Generation

PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction

Open-World Human-Object Interaction Detection via Multi-modal Prompts

Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations

InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction

Object Motion Guided Human Motion Synthesis

AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation

Gaze-guided Hand-Object Interaction Synthesis: Dataset and Method

Inter-X: Towards Versatile Human-Human Interaction Analysis

HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction

A Novel Multi-Stream Hand-Object Interaction Network for Assembly Action Recognition

CHAIRS: Towards Full-Body Articulated Human-Object Interaction

Interactive Humanoid: Online Full-Body Motion Reaction Synthesis with Social Affordance Canonicalization and Forecasting

InterFusion: Text-Driven Generation of 3D Human-Object Interaction

Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization.

Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model

DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors