Abstract:Synthesizing high-fidelity head avatars is a central problem for computer vision and graphics. While head avatar synthesis algorithms have advanced rapidly, the best ones still face great obstacles in real-world scenarios. One of the vital causes is inadequate datasets -- 1) current public datasets can only support researchers to explore high-fidelity head avatars in one or two task directions; 2) these datasets usually contain digital head assets with limited data volume, and narrow distribution over different attributes. In this paper, we present RenderMe-360, a comprehensive 4D human head dataset to drive advance in head avatar research. It contains massive data assets, with 243+ million complete head frames, and over 800k video sequences from 500 different identities captured by synchronized multi-view cameras at 30 FPS. It is a large-scale digital library for head avatars with three key attributes: 1) High Fidelity: all subjects are captured by 60 synchronized, high-resolution 2K cameras in 360 degrees. 2) High Diversity: The collected subjects vary from different ages, eras, ethnicities, and cultures, providing abundant materials with distinctive styles in appearance and geometry. Moreover, each subject is asked to perform various motions, such as expressions and head rotations, which further extend the richness of assets. 3) Rich Annotations: we provide annotations with different granularities: cameras' parameters, matting, scan, 2D/3D facial landmarks, FLAME fitting, and text description. Based on the dataset, we build a comprehensive benchmark for head avatar research, with 16 state-of-the-art methods performed on five main tasks: novel view synthesis, novel expression synthesis, hair rendering, hair editing, and talking head generation. Our experiments uncover the strengths and weaknesses of current methods. RenderMe-360 opens the door for future exploration in head avatars.

A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Audio-driven Talking Face Video Generation with Natural Head Pose

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

High-Fidelity and Freely Controllable Talking Head Video Generation

Talking Faces: Audio-to-Video Face Generation

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

MakeItTalk: Speaker-Aware Talking-Head Animation

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles

Audio-Driven Emotional 3D Talking-Head Generation

Towards Realistic Conversational Head Generation: A Comprehensive Framework for Lifelike Video Synthesis

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads

Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

RenderMe-360: A Large Digital Asset Library and Benchmarks Towards High-fidelity Head Avatars

TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking Styles

Manitalk: manipulable talking head generation from single image in the wild