Abstract:With the rapid development of multimedia technology, personnel verification systems have become increasingly important in the security field and identity verification. However, unimodal verification systems have performance bottlenecks in complex scenarios, thus triggering the need for multimodal feature fusion methods. The main problem with audio-visual multimodal feature fusion is how to effectively integrate information from different modalities to improve the accuracy and robustness of the system for individual identity. In this paper, we focus on how to improve multimodal person verification systems and how to combine audio and visual features. In this study, we use pretrained models to extract the embeddings from each modality and then perform fusion model experiments based on these embeddings. The baseline approach in this paper involves taking the fusion feature and passing it through a fully connected (FC) layer. Building upon this baseline, we propose three fusion models based on attentional mechanisms: attention, gated, and inter-attention. These fusion models are trained on the VoxCeleb1 development set and tested on the evaluation sets of the VoxCeleb1, NIST SRE19, and CNC-AV datasets. On the VoxCeleb1 dataset, the best system performance achieved in this study was an equal error rate (EER) of 0.23% and a detection cost function (minDCF) of 0.011. On the evaluation set of NIST SRE19, the EER was 2.60% and the minDCF was 0.283. On the evaluation set of the CNC-AV set, the EER was 11.30% and the minDCF was 0.443. These experimental results strongly demonstrate that the proposed fusion method can significantly improve the performance of multimodal character verification systems.

Bi-attention Modal Separation Network for Multimodal Video Fusion

Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis

Mutual information maximization and feature space separation and bi-bimodal mo-dality fusion for multimodal sentiment analysis

Attention-Based Multimodal Fusion for Video Description

MSAF: Multimodal Split Attention Fusion

Deep Multimodal Data Fusion

Tri-Modalities Fusion for Multimodal Sentiment Analysis

CMCI: A Robust Multimodal Fusion Method for Spiking Neural Networks

Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

Video Sentiment Analysis with Bimodal Information-augmented Multi-Head Attention

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Multimodal Fusion Method Based on Self-Attention Mechanism

Optimal Multimodal Fusion for Multimedia Data Analysis

BAFN: Bi-Direction Attention Based Fusion Network for Multimodal Sentiment Analysis

NHFNET: A Non-Homogeneous Fusion Network for Multimodal Sentiment Analysis

Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

Multimodal Sentiment Analysis Based on Cross-Modal Attention and Gated Cyclic Hierarchical Fusion Networks

Multimodal emotion recognition from facial expression and speech based on feature fusion

Audio-Visual Fusion Based on Interactive Attention for Person Verification

Feature Extraction Network with Attention Mechanism for Data Enhancement and Recombination Fusion for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Using Multi-tensor Fusion Network with Cross-modal Modeling