Abstract:Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation. In this research, we propose an EmotiveTalk framework to address these issues. Firstly, to realize better control over the generation of lip movement and facial expression, a Vision-guided Audio Information Decoupling (V-AID) approach is designed to generate audio-based decoupled representations aligned with lip movements and expression. Specifically, to achieve alignment between audio and facial expression representation spaces, we present a Diffusion-based Co-speech Temporal Expansion (Di-CTE) module within V-AID to generate expression-related representations under multi-source emotion condition constraints. Then we propose a well-designed Emotional Talking Head Diffusion (ETHD) backbone to efficiently generate highly expressive talking head videos, which contains an Expression Decoupling Injection (EDI) module to automatically decouple the expressions from reference portraits while integrating the target expression information, achieving more expressive generation performance. Experimental results show that EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation, yielding state-of-the-art performance compared to existing methods.

Toward Any-to-Any Emotion Voice Conversion using Disentangled Diffusion Framework

Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity

Mixed-EVC: Mixed Emotion Synthesis and Control in Voice Conversion

Emotional Voice Conversion with Semi-Supervised Generative Modeling

EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data

One-shot Emotional Voice Conversion Based on Feature Separation

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training

DurFlex-EVC: Duration-Flexible Emotional Voice Conversion with Parallel Generation

Decoupling Speaker-Independent Emotions for Voice Conversion Via Source-Filter Networks

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

Multi-Target Emotional Voice Conversion With Neural Vocoders

Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models

An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation

VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech

Learning Representations of Emotional Speech with Deep Convolutional Generative Adversarial Networks

Emotional Voice Conversion With Cycle-consistent Adversarial Network

Improving Model Stability and Training Efficiency in Fast, High Quality Expressive Voice Conversion System

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

Emotion Intensity and its Control for Emotional Voice Conversion

A Generation of Enhanced Data by Variational Autoencoders and Diffusion Modeling