Abstract:With the rapid development of multimedia technology, personnel verification systems have become increasingly important in the security field and identity verification. However, unimodal verification systems have performance bottlenecks in complex scenarios, thus triggering the need for multimodal feature fusion methods. The main problem with audio-visual multimodal feature fusion is how to effectively integrate information from different modalities to improve the accuracy and robustness of the system for individual identity. In this paper, we focus on how to improve multimodal person verification systems and how to combine audio and visual features. In this study, we use pretrained models to extract the embeddings from each modality and then perform fusion model experiments based on these embeddings. The baseline approach in this paper involves taking the fusion feature and passing it through a fully connected (FC) layer. Building upon this baseline, we propose three fusion models based on attentional mechanisms: attention, gated, and inter-attention. These fusion models are trained on the VoxCeleb1 development set and tested on the evaluation sets of the VoxCeleb1, NIST SRE19, and CNC-AV datasets. On the VoxCeleb1 dataset, the best system performance achieved in this study was an equal error rate (EER) of 0.23% and a detection cost function (minDCF) of 0.011. On the evaluation set of NIST SRE19, the EER was 2.60% and the minDCF was 0.283. On the evaluation set of the CNC-AV set, the EER was 11.30% and the minDCF was 0.443. These experimental results strongly demonstrate that the proposed fusion method can significantly improve the performance of multimodal character verification systems.

Fusing audio and visual features of speech

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Multi-level Fusion of Audio and Visual Features for Speaker Identification

Audio-Visual Fusion Based on Interactive Attention for Person Verification

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

Audio-Visual Speech Recognition Using A Two-Step Feature Fusion Strategy.

Deep Multimodal Learning for Audio-Visual Speech Recognition

Integration of Multimodal Features for Video Scene Classification Based on HMM

End-to-End Audiovisual Fusion with LSTMs

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Combining Information from Multi-Stream Features Using Deep Neural Network in Speech Recognition

Robust Audio-Visual Speech Recognition Based on Hybrid Fusion

A Hidden Markov model for Bayesian data fusion of multivariate signals

Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Fusion of deep shallow features and models for speaker recognition

Attentive Fusion Enhanced Audio-Visual Encoding for Transformer Based Robust Speech Recognition

Bimodal speaker identification using dynamic bayesian network

Integrating both Visual and Audio Cues for Enhanced Video Caption

A Fusion Approach to Spoken Language Identification Based on Combining Multiple Phone Recognizers and Speech Attribute Detectors

AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement

Online Early-Late Fusion Based on Adaptive HMM for Sign Language Recognition