Abstract:Transformer models have demonstrated superior performance across various domains, including computer vision , natural language processing , and speech recognition. The success of these models can be attributed to their robust parallel capacity and high computation speed, primarily reliant on the attention layer. In the domain of speaker recognition, state-of-the-art results have been achieved using convolutional neural network (CNN) architectures, particularly with speaker embeddings represented by x-vectors and r-vectors. However, existing CNN-based methods tend to focus on local features while overlooking the global dependence of voiceprint features, resulting in the loss of crucial information. Moreover, the presence of noise in audio data is an influential factor that cannot be disregarded, as it significantly impacts the extraction of discriminative speaker embeddings. To address these challenges, we propose a novel model called the Multi-Scale Expand Convolution Transformer (MEConformer). This model aims to convert variable-length audio into a fixed low-dimensional representation. The MEConformer leverages a CNN framework with expanded receptive fields to capture frame-level features effectively. Additionally, we introduce a transformer encoder that incorporates contextual dependencies, enabling the extraction of both frame-level and discourse-level feature representations. Furthermore, we present a multi-scale residual aggregation strategy, which facilitates the efficient transmission of voiceprint information across the model. By combining these innovative components, the MEConformer achieves a state-of-the-art Equal Error Rate (EER) of 3.72% on the VoxCeleb1 test set. Furthermore, it demonstrates EERs of 5.94% and 3.72% on the VoxCeleb1-H and VoxCeleb-E datasets, respectively. The code for the proposed MEConformer model will be made publicly available at https://codeocean.com/capsule/4563012/tree .

An Effective Transformer-based Contextual Model and Temporal Gate Pooling for Speaker Identification

Robust Speaker Recognition with Transformers Using wav2vec 2.0

Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations

Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems

P-vectors: A Parallel-Coupled TDNN/Transformer Network for Speaker Verification

MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder

VOT: Revolutionizing Speaker Verification with Memory and Attention Mechanisms

Improving Multi-Speaker Tacotron with Speaker Gating Mechanisms

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification

Improving Deep CNN Networks with Long Temporal Context for Text-Independent Speaker Verification

Attentive Temporal Pooling for Conformer-based Streaming Language Identification in Long-form Speech

Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation

TMS: Temporal multi-scale in time-delay neural network for speaker verification

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

Investigation of Speaker-adaptation methods in Transformer based ASR

Robust Feature Extraction Using Temporal Context Averaging for Speaker Identification in Diverse Acoustic Environments

A Transformer Model for Segmentation, Classification, and Caller Identification of Marmoset Vocalization

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification