Ivs-Net: Learning Human View Synthesis from Internet Videos

Junting Dong,Qi Fang,Tianshuo Yang,Qing Shuai,Chengyu Qiao,Sida Peng
DOI: https://doi.org/10.1109/iccv51070.2023.02097
2023-01-01
Abstract:Recent advances in implicit neural representations make it possible to generate free-viewpoint videos of the human from sparse view images. To avoid the expensive training for each person, previous methods adopt the generalizable human model and demonstrate impressive results. However, these methods usually rely on limited multi-view images typically collected in the studio or commercial high-quality 3D scans for training, which heavily prohibits their generalization capability for in-the-wild images. To solve this problem, we propose a new approach to learn a generalizable human model from a new source of data, i.e., Internet videos. These videos capture various human appearances and poses and record the performers from abundant viewpoints. To exploit the Internet data, we present a video self-supervised pipeline to enforce the local appearance consistency of each body part over different frames of the same video. Once learned, the human model enables realistic novel view synthesis from a single input image. Experiments show that our method can generate high-quality view synthesis on in-the-wild images while only training on monocular videos.
What problem does this paper attempt to address?