Replay: Multi-modal Multi-view Acted Videos for Casual Holography

Roman Shapovalov,Yanir Kleiman,Ignacio Rocco,David Novotny,Andrea Vedaldi,Changan Chen,Filippos Kokkinos,Ben Graham,Natalia Neverova
2023-07-22
Abstract:We introduce Replay, a collection of multi-view, multi-modal videos of humans interacting socially. Each scene is filmed in high production quality, from different viewpoints with several static cameras, as well as wearable action cameras, and recorded with a large array of microphones at different positions in the room. Overall, the dataset contains over 4000 minutes of footage and over 7 million timestamped high-resolution frames annotated with camera poses and partially with foreground masks. The Replay dataset has many potential applications, such as novel-view synthesis, 3D reconstruction, novel-view acoustic synthesis, human body and face analysis, and training generative models. We provide a benchmark for training and evaluating novel-view synthesis, with two scenarios of different difficulty. Finally, we evaluate several baseline state-of-the-art methods on the new benchmark.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The aim of this paper is to address the problem of developing a high-quality, multimodal, multi-view dataset to facilitate research on novel-view synthesis, 3D reconstruction, acoustic view synthesis, and other technologies in natural social interaction scenarios. Specifically, the researchers have created a dataset named Replay, which aims to overcome the limitations of existing datasets, particularly for long-duration, dynamic content involving complex human interactions. The Replay dataset has the following features: 1. **Multimodal and Multi-view**: Captures video and audio information through various types of cameras (including fixed cameras and wearable action cameras) and microphones placed at different locations. 2. **High Resolution and Long Duration Recording**: Each scene is recorded for approximately 3 to 5 minutes, with a total duration exceeding 4000 minutes, and recorded in 4K resolution. 3. **Rich Content and Background**: The scenes include multiple actors performing various activities (such as playing games, chatting, etc.) in indoor environments, with diverse and realistic backgrounds. 4. **Detailed Annotations**: Provides a large number of high-definition frames with timestamp annotations, as well as foreground segmentation masks for some scenes. 5. **Supports Multiple Tasks**: Can be used for evaluating and training various tasks, such as novel view synthesis, 3D reconstruction, sound synthesis, human and facial analysis, etc. To demonstrate the application potential of the Replay dataset, the paper defines two benchmark tasks—“flyaround” and “acting,” which are used to evaluate novel view synthesis methods at different difficulty levels. Additionally, several existing advanced algorithms, including neural radiance field-based methods (such as NeRF and its variants), are evaluated. In summary, the main goal of this paper is to promote the development of generation and reconstruction technologies for natural interaction content in fields such as virtual reality and augmented reality by providing a comprehensive and challenging dataset like Replay.