HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects

Xintao Lv,Liang Xu,Yichao Yan,Xin Jin,Congsheng Xu,Shuwen Wu,Yifan Liu,Lincheng Li,Mengxiao Bi,Wenjun Zeng,Xiaokang Yang
2024-09-11
Abstract:Generating human-object interactions (HOIs) is critical with the tremendous advances of digital avatars. Existing datasets are typically limited to humans interacting with a single object while neglecting the ubiquitous manipulation of multiple objects. Thus, we propose HIMO, a large-scale MoCap dataset of full-body human interacting with multiple objects, containing 3.3K 4D HOI sequences and 4.08M 3D HOI frames. We also annotate HIMO with detailed textual descriptions and temporal segments, benchmarking two novel tasks of HOI synthesis conditioned on either the whole text prompt or the segmented text prompts as fine-grained timeline control. To address these novel tasks, we propose a dual-branch conditional diffusion model with a mutual interaction module for HOI synthesis. Besides, an auto-regressive generation pipeline is also designed to obtain smooth transitions between HOI segments. Experimental results demonstrate the generalization ability to unseen object geometries and temporal compositions.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing data sets and models in generating Human - Object Interactions (HOIs), especially ignoring the complexity and diversity of human interactions with multiple objects. Specifically: 1. **Limitations of single - object interactions**: Most of the existing data sets and models are limited to human - single - object interactions, ignoring the common multi - object interaction scenarios in daily life. Such limitations have led to insufficient research and development of multi - object interaction synthesis. 2. **Lack of detailed temporal segmentation and text descriptions**: Existing data sets usually do not provide detailed text descriptions and temporal segmentation annotations, which makes it difficult to perform fine - grained timeline control and multi - step human - object interaction synthesis. To solve these problems, the author proposes HIMO (Human Interacting with Multiple Objects), which is a large - scale 4D HOI data set, containing 3,376 full - body interaction sequences and 4,080,000 3D HOI frames. The characteristics of the HIMO data set include: - **Multi - object interactions**: In each sequence, humans interact with multiple daily objects. - **Detailed text descriptions and temporal segmentation**: Each long - interaction sequence is finely segmented into multiple time segments and is accompanied by detailed text descriptions, facilitating the generation of more complex interaction actions. In addition, the paper also proposes two new tasks: 1. **Text - based HOI synthesis (HIMO - Gen)**: Generate human - multiple - object interaction actions according to the entire text prompt. 2. **Segmented - text - based timeline - controlled HOI synthesis (HIMO - SegGen)**: Generate multi - step interaction actions with smooth transitions according to the segmented - text prompt. Through these improvements, the HIMO data set and the proposed model aim to promote the research of multi - object interaction synthesis and improve the authenticity and harmony of the generated results.