Abstract:Speaker verification (SV) technology has the potential to enhance personalization and security in various applications, such as voice assistants, forensics, and access control. However, several challenges hinder the practical application of SV systems, including limitations and distortions in speaker information due to short utterances and noisy environments. Furthermore, these two factors often coexist in real-world situations, resulting in a significant performance degradation of SV systems. Despite the significance of these obstacles, each factor is independently studied, and the co-occurrence of both factors is rarely investigated. Here, we propose a novel SV framework, feature aggregated extended U-Net (FA-ExU-Net), which simultaneously addresses both the challenges by building on the success of prior research on each factor. The FA-ExU-Net incorporates an iterative and hierarchical feature aggregation scheme, a target task-specific feature enhancement module, and a multi-scale feature aggregator for extracting information-rich embeddings. Our proposed system outperforms the recent baseline models based on four evaluation criteria: generalizability, short utterance performance, capacity to handle noisy environments, and robustness to short utterances in noisy environments. We demonstrate the effectiveness of the proposed model through comparison and ablation experiments and intuitive visualizations. The proposed novel approach is expected to contribute to the development of more robust and accurate SV models for practical applications.

Deep Speaker Embedding with Multi-Part Information Aggregation in Frequency-Time Domain for ASV

VarASV: Enabling Pitch-variable Automatic Speaker Verification Via Multi-task Learning

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Learning Deep Embedding with Acoustic and Phoneme Features for Speaker Recognition in FM Broadcasting

Deep Segment Attentive Embedding for Duration Robust Speaker Verification

Deep Speaker: an End-to-End Neural Speaker Embedding System

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification

Adversarial Speaker Verification.

Wav2sv: End-to-end Speaker Embeddings Learning from Raw Waveforms Based on Metric Learning for Speaker Verification.

Multi-View Speaker Embedding Learning for Enhanced Stability and Discriminability.

Y-Vector: Multiscale Waveform Encoder for Speaker Embedding

ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention

VAE-based Domain Adaptation for Speaker Verification.

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales.

Introducing Multilingual Phonetic Information to Speaker Embedding for Speaker Verification

VOT: Revolutionizing Speaker Verification with Memory and Attention Mechanisms

FA-ExU-Net: the simultaneous training of an embedding extractor and enhancement model for a speaker verification system robust to short noisy utterances