Abstract:High-dimensional data such as natural images or speech signals exhibit some form of regularity, preventing their dimensions from varying independently. This suggests that there exists a lower dimensional latent representation from which the high-dimensional observed data were generated. Uncovering the hidden explanatory features of complex data is the goal of representation learning, and deep latent variable generative models have emerged as promising unsupervised approaches. In particular, the variational autoencoder (VAE) which is equipped with both a generative and an inference model allows for the analysis, transformation, and generation of various types of data. Over the past few years, the VAE has been extended to deal with data that are either multimodal or dynamical (i.e., sequential). In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audiovisual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.

DSVAE: Interpretable Disentangled Representation for Synthetic Speech Detection

Siamese Network with Wav2vec Feature for Spoofing Speech Detection

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Detection of Synthetic Speech Based on Spectrum Defects

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples

Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNs

Using Deep Learning Techniques and Inferential Speech Statistics for AI Synthesised Speech Recognition

AI-Synthesized Voice Detection Using Neural Vocoder Artifacts

Detection of AI Synthesized Hindi Speech

Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection

A multimodal dynamical variational autoencoder for audiovisual speech representation learning

A Two-Stage Deep Representation Learning-Based Speech Enhancement Method Using Variational Autoencoder and Adversarial Training

Spoofing Detection Goes Noisy: An Analysis of Synthetic Speech Detection in the Presence of Additive Noise

A blended framework for audio spoof detection with sequential models and bags of auditory bites

Audio Spoofing Verification using Deep Convolutional Neural Networks by Transfer Learning

Enhancing Synthesized Speech Detection with Dual Attention Using Features Fusion

Towards End-to-End Synthetic Speech Detection

DyViSE: Dynamic Vision-Guided Speaker Embedding for Audio-Visual Speaker Diarization

Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection

A lightweight feature extraction technique for deepfake audio detection