Abstract:Transformer models have demonstrated superior performance across various domains, including computer vision , natural language processing , and speech recognition. The success of these models can be attributed to their robust parallel capacity and high computation speed, primarily reliant on the attention layer. In the domain of speaker recognition, state-of-the-art results have been achieved using convolutional neural network (CNN) architectures, particularly with speaker embeddings represented by x-vectors and r-vectors. However, existing CNN-based methods tend to focus on local features while overlooking the global dependence of voiceprint features, resulting in the loss of crucial information. Moreover, the presence of noise in audio data is an influential factor that cannot be disregarded, as it significantly impacts the extraction of discriminative speaker embeddings. To address these challenges, we propose a novel model called the Multi-Scale Expand Convolution Transformer (MEConformer). This model aims to convert variable-length audio into a fixed low-dimensional representation. The MEConformer leverages a CNN framework with expanded receptive fields to capture frame-level features effectively. Additionally, we introduce a transformer encoder that incorporates contextual dependencies, enabling the extraction of both frame-level and discourse-level feature representations. Furthermore, we present a multi-scale residual aggregation strategy, which facilitates the efficient transmission of voiceprint information across the model. By combining these innovative components, the MEConformer achieves a state-of-the-art Equal Error Rate (EER) of 3.72% on the VoxCeleb1 test set. Furthermore, it demonstrates EERs of 5.94% and 3.72% on the VoxCeleb1-H and VoxCeleb-E datasets, respectively. The code for the proposed MEConformer model will be made publicly available at https://codeocean.com/capsule/4563012/tree .

Deep Speaker Embedding Extraction with Channel-Wise Feature Responses and Additive Supervision Softmax Loss Function

Learning Discriminative Speaker Embedding by Improving Aggregation Strategy and Loss Function for Speaker Verification

An Effective Deep Embedding Learning Method Based on Dense-Residual Networks for Speaker Verification

Self-attention Based Speaker Recognition Using Cluster-Range Loss

An Effective Deep Embedding Learning Architecture for Speaker Verification.

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

An Effective Speaker Recognition Method Based on Joint Identification and Verification Supervisions.

Ensemble Additive Margin Softmax for Speaker Verification

Dynamic Margin Softmax Loss for Speaker Verification

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

Large Margin Softmax Loss for Speaker Verification

End-to-End Feature Learning for Text-Independent Speaker Verification

Deep Speaker: an End-to-End Neural Speaker Embedding System

ECAPA++: Fine-grained Deep Embedding Learning for TDNN Based Speaker Verification

Depth-First Neural Architecture with Attentive Feature Fusion for Efficient Speaker Verification.

Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network.

Attentive Feature Fusion for Robust Speaker Verification

Contrastive Learning for improving End-to-end Speaker Verification

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales.

MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder

ResNeXt and Res2Net Structures for Speaker Verification