Abstract:We introduce Replay, a collection of multi-view, multi-modal videos of humans interacting socially. Each scene is filmed in high production quality, from different viewpoints with several static cameras, as well as wearable action cameras, and recorded with a large array of microphones at different positions in the room. Overall, the dataset contains over 4000 minutes of footage and over 7 million timestamped high-resolution frames annotated with camera poses and partially with foreground masks. The Replay dataset has many potential applications, such as novel-view synthesis, 3D reconstruction, novel-view acoustic synthesis, human body and face analysis, and training generative models. We provide a benchmark for training and evaluating novel-view synthesis, with two scenarios of different difficulty. Finally, we evaluate several baseline state-of-the-art methods on the new benchmark.

What problem does this paper attempt to address?

The aim of this paper is to address the problem of developing a high-quality, multimodal, multi-view dataset to facilitate research on novel-view synthesis, 3D reconstruction, acoustic view synthesis, and other technologies in natural social interaction scenarios. Specifically, the researchers have created a dataset named Replay, which aims to overcome the limitations of existing datasets, particularly for long-duration, dynamic content involving complex human interactions. The Replay dataset has the following features: 1. **Multimodal and Multi-view**: Captures video and audio information through various types of cameras (including fixed cameras and wearable action cameras) and microphones placed at different locations. 2. **High Resolution and Long Duration Recording**: Each scene is recorded for approximately 3 to 5 minutes, with a total duration exceeding 4000 minutes, and recorded in 4K resolution. 3. **Rich Content and Background**: The scenes include multiple actors performing various activities (such as playing games, chatting, etc.) in indoor environments, with diverse and realistic backgrounds. 4. **Detailed Annotations**: Provides a large number of high-definition frames with timestamp annotations, as well as foreground segmentation masks for some scenes. 5. **Supports Multiple Tasks**: Can be used for evaluating and training various tasks, such as novel view synthesis, 3D reconstruction, sound synthesis, human and facial analysis, etc. To demonstrate the application potential of the Replay dataset, the paper defines two benchmark tasks—“flyaround” and “acting,” which are used to evaluate novel view synthesis methods at different difficulty levels. Additionally, several existing advanced algorithms, including neural radiance field-based methods (such as NeRF and its variants), are evaluated. In summary, the main goal of this paper is to promote the development of generation and reconstruction technologies for natural interaction content in fields such as virtual reality and augmented reality by providing a comprehensive and challenging dataset like Replay.

Replay: Multi-modal Multi-view Acted Videos for Casual Holography

Novel View Synthesis of Human Interactions from Sparse Multi-view Videos

SelfRecon: Self Reconstruction Your Digital Avatar from Monocular Video

Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

Neural Rendering and Reenactment of Human Actor Videos

Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis

View Synthesis of Dynamic Scenes based on Deep 3D Mask Volume

PKU-DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling

Novel View Synthesis of Humans using Differentiable Rendering

Plum-pudding gels as a platform for drug delivery: understanding the effects of the different components on the diffusion behavior of solutes.

Human Pose Manipulation and Novel View Synthesis using Differentiable Rendering

HSPACE: Synthetic Parametric Humans Animated in Complex Environments

Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

Harmony4D: A Video Dataset for In-The-Wild Close Human Interactions

Interaction Replica: Tracking Human-Object Interaction and Scene Changes From Human Motion

Panoptic Studio: A Massively Multiview System for Social Interaction Capture

Headset: Human emotion awareness under partial occlusions multimodal dataset

Video-based Characters

Novel View Synthesis of Dynamic Human with Sparse Cameras.