Abstract:With the rapid development of multimedia technology, personnel verification systems have become increasingly important in the security field and identity verification. However, unimodal verification systems have performance bottlenecks in complex scenarios, thus triggering the need for multimodal feature fusion methods. The main problem with audio-visual multimodal feature fusion is how to effectively integrate information from different modalities to improve the accuracy and robustness of the system for individual identity. In this paper, we focus on how to improve multimodal person verification systems and how to combine audio and visual features. In this study, we use pretrained models to extract the embeddings from each modality and then perform fusion model experiments based on these embeddings. The baseline approach in this paper involves taking the fusion feature and passing it through a fully connected (FC) layer. Building upon this baseline, we propose three fusion models based on attentional mechanisms: attention, gated, and inter-attention. These fusion models are trained on the VoxCeleb1 development set and tested on the evaluation sets of the VoxCeleb1, NIST SRE19, and CNC-AV datasets. On the VoxCeleb1 dataset, the best system performance achieved in this study was an equal error rate (EER) of 0.23% and a detection cost function (minDCF) of 0.011. On the evaluation set of NIST SRE19, the EER was 2.60% and the minDCF was 0.283. On the evaluation set of the CNC-AV set, the EER was 11.30% and the minDCF was 0.443. These experimental results strongly demonstrate that the proposed fusion method can significantly improve the performance of multimodal character verification systems.

Audio-Visual Multi-person Keyword Spotting Via Hybrid Fusion

Audio-visual Keyword Spotting Based on Adaptive Decision Fusion under Noisy Conditions for Human-Robot Interaction.

Audio-visual Keyword Spotting for Mandarin Based on Discriminative Local Spatial-Temporal Descriptors.

Audio-Visual Keyword Spotting Based on Multidimensional Convolutional Neural Network

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

On‐device Audio‐visual Multi‐person Wake Word Spotting

Audio–visual Keyword Transformer for Unconstrained Sentence‐level Keyword Spotting

Robust Audio-Visual Speech Recognition Based on Hybrid Fusion

A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion

Audio-Visual Speech Recognition Using A Two-Step Feature Fusion Strategy.

Robust Dual-Modal Speech Keyword Spotting for XR Headsets

MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting

Multi-level Fusion of Audio and Visual Features for Speaker Identification

Robust Wake Word Spotting With Frame-Level Cross-Modal Attention Based Audio-Visual Conformer

Audio-Visual Fusion Based on Interactive Attention for Person Verification

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

U2-KWS: Unified Two-pass Open-vocabulary Keyword Spotting with Keyword Bias