Abstract:Voice and face are two most popular biometrics for person verification, usually used in speaker verification and face verification tasks. It has already been observed that simply combining the information from these two modalities can lead to a more powerful and robust person verification system. In this article, to fully explore the multi-modal learning strategies for person verification, we proposed three types of audio-visual deep neural network (AVN), including feature level AVN (AVN-F), embedding level AVN (AVN-E), and embedding level combination with joint learning AVN (AVN-J). To further enhance the system robustness in real noisy conditions where not both modalities can be accessed with high-quality, we proposed several data augmentation strategies for each proposed AVN: A feature-level multi-modal data augmentation is proposed for AVN-F and an embedding-level data augmentation with novel noise distribution matching is designed for AVN-E. For AVN-J, both the feature and embedding level multi-modal data augmentation methods can be applied. All the proposed models are trained on the VoxCeleb2 dev dataset and evaluated on the standard VoxCeleb1 dataset, and the best system achieves 0.558, 0.441% and 0.793% EER on the three official trial lists of VoxCeleb1, which is to our knowledge the best published single system results on this corpus for person verification. To validate the robustness of the proposed approaches, a noisy evaluation set based on the VoxCeleb1 is constructed, and experimental results show that the proposed system can significantly boost the system robustness and still show promising performance under this noisy scenario.

Self-Adaptive Soft Voice Activity Detection using Deep Neural Networks for Robust Speaker Verification

A Novel and Efficient Voice Activity Detector Using Shape Features of Speech Wave.

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

End-to-End Speaker-Dependent Voice Activity Detection

DNN-based Voice Activity Detection for Speaker Recognition

Improving Deep Neural Networks Based Speaker Verification Using Unlabeled Data

Voice Presentation Attack Detection Using Convolutional Neural Networks

Deep Speaker Vectors for Semi Text-independent Speaker Verification

Multi-task Joint-Learning for Robust Voice Activity Detection

A Universal VAD Based on Jointly Trained Deep Neural Networks.

A Robust Text-independent Speaker Verification Method Based on Speech Separation and Deep Speaker.

An Adaptive X-Vector Model for Text-Independent Speaker Verification

Denoising Deep Neural Networks Based Voice Activity Detection

A Comparative Study of Robustness of Deep Learning Approaches for VAD

Deep Learning Approaches for Voice Activity Detection

Phoneme-Aware Adaptation with Discrepancy Minimization and Dynamically-Classified Vector for Text-independent Speaker Verification

Robust Voice Activity Detection Using a Masked Auditory Encoder Based Convolutional Neural Network.

VAE-based Domain Adaptation for Speaker Verification.

Audio-Visual Deep Neural Network for Robust Person Verification

Personal VAD: Speaker-Conditioned Voice Activity Detection

Integrated Replay Spoofing-Aware Text-Independent Speaker Verification