Abstract:Recently, multi-speaker multimedia speaker recognition (MMSR) has garnered significant attention. Although prior research primarily focused on the back-end score level fusion of audio and visual information, this study delves into innovative techniques for integrating audio and visual cues from the front-end representations of both speaker's voice and face. The first method introduces the use of visual information to estimate the number of speakers. This solution addresses the challenges of estimating speaker numbers in multi-speaker conversations, especially in noisy environments. Subsequently, agglomerative hierarchical clustering is employed for speaker diarization, proving beneficial for MMSR. This approach is termed video aiding audio fusion (VAAF). The second method innovates by introducing a ratio factor to create a multimedia vector (M-vector) which concatenates face embeddings with x-vector. This amalgamation encapsulates both audio and visual cues. The resulting M-vector is then leveraged for MMSR. We name this method as video interacting audio fusion (VIAF). Experimental results on the NIST SRE 2019 audio-visual corpus reveal that the VAAF-based MMSR achieves a 6.94% and 8.31% relative reduction in minDCF and actDCF, respectively, when benchmarked against zero-effort systems. Additionally, the VIAF-based MMSR realizes a 12.08% and 12.99% relative reduction in minDCF and actDCF, respectively, compared to systems that solely utilize face embeddings. Notably, when combining both methods, the minDCF and actDCF metrics are further optimized, reaching 0.098 and 0.102, respectively.

A Speaker Identification System for Video Content Analysis

Design and implementation of a speaker recognition system

Emotional Speaker Identification By Humans And Machines

An Automatic Detecting System Based on Sound Detection and Image Analysis

Preliminary Study on Self-contained UBM Construction for Speaker Recognition.

Audio-Visual System for Robust Speaker Recognition.

Efficient Identification Of Speakers In News Video Based On Shot Segmentation

Speaker Identification based on LSP and Gaussian Mixture Model

Combined GMM-UBM and SVM Speaker Identification System

Integration of audio-visual information for multi-speaker multimedia speaker recognition

Audio and Video Combined for Home Video Abstraction

Token-level Speaker Change Detection Using Speaker Difference and Speech Content via Continuous Integrate-and-fire

Speaker Identification System Based on Hybrid Neural Network

Robust Text-independent Speaker Identification Based on Gaussian Speech-filter

Speaker identification by BYY automatic local factor analysis based three-level voting combination

Using Subband Mel-spectrum Centroid and Gaussian Mixture Correlation for Robust Speaker Identification

Automatic Home Video Abstraction Using Audio Contents

Look, Listen and Learn - A Multimodal LSTM for Speaker Identification

A Scheme Discriminating Between Synthetic Speech and Normal Speech

Inspection of Video Frequency Scene Based on Audio Frequency Analysis

Speaker Identification Using a Reference Speaker Model Based a Two-Layer Structure