FA-ExU-Net: the simultaneous training of an embedding extractor and enhancement model for a speaker verification system robust to short noisy utterances
Ju-ho Kim,Jungwoo Heo,Hyun-seo Shin,Chan-yeong Lim,Ha-Jin Yu
DOI: https://doi.org/10.1109/taslp.2024.3381005
2024-01-01
Abstract:Speaker verification (SV) technology has the potential to enhance personalization and security in various applications, such as voice assistants, forensics, and access control. However, several challenges hinder the practical application of SV systems, including limitations and distortions in speaker information due to short utterances and noisy environments. Furthermore, these two factors often coexist in real-world situations, resulting in a significant performance degradation of SV systems. Despite the significance of these obstacles, each factor is independently studied, and the co-occurrence of both factors is rarely investigated. Here, we propose a novel SV framework, feature aggregated extended U-Net (FA-ExU-Net), which simultaneously addresses both the challenges by building on the success of prior research on each factor. The FA-ExU-Net incorporates an iterative and hierarchical feature aggregation scheme, a target task-specific feature enhancement module, and a multi-scale feature aggregator for extracting information-rich embeddings. Our proposed system outperforms the recent baseline models based on four evaluation criteria: generalizability, short utterance performance, capacity to handle noisy environments, and robustness to short utterances in noisy environments. We demonstrate the effectiveness of the proposed model through comparison and ablation experiments and intuitive visualizations. The proposed novel approach is expected to contribute to the development of more robust and accurate SV models for practical applications.
engineering, electrical & electronic,acoustics