Abstract:Transformer models have demonstrated superior performance across various domains, including computer vision , natural language processing , and speech recognition. The success of these models can be attributed to their robust parallel capacity and high computation speed, primarily reliant on the attention layer. In the domain of speaker recognition, state-of-the-art results have been achieved using convolutional neural network (CNN) architectures, particularly with speaker embeddings represented by x-vectors and r-vectors. However, existing CNN-based methods tend to focus on local features while overlooking the global dependence of voiceprint features, resulting in the loss of crucial information. Moreover, the presence of noise in audio data is an influential factor that cannot be disregarded, as it significantly impacts the extraction of discriminative speaker embeddings. To address these challenges, we propose a novel model called the Multi-Scale Expand Convolution Transformer (MEConformer). This model aims to convert variable-length audio into a fixed low-dimensional representation. The MEConformer leverages a CNN framework with expanded receptive fields to capture frame-level features effectively. Additionally, we introduce a transformer encoder that incorporates contextual dependencies, enabling the extraction of both frame-level and discourse-level feature representations. Furthermore, we present a multi-scale residual aggregation strategy, which facilitates the efficient transmission of voiceprint information across the model. By combining these innovative components, the MEConformer achieves a state-of-the-art Equal Error Rate (EER) of 3.72% on the VoxCeleb1 test set. Furthermore, it demonstrates EERs of 5.94% and 3.72% on the VoxCeleb1-H and VoxCeleb-E datasets, respectively. The code for the proposed MEConformer model will be made publicly available at https://codeocean.com/capsule/4563012/tree .

Local Information Modeling with Self-Attention for Speaker Verification

Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Self-Attention Networks for Text-Independent Speaker Verification

Self-attention Based Speaker Recognition Using Cluster-Range Loss

VOT: Revolutionizing Speaker Verification with Memory and Attention Mechanisms

Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Improving Speaker Verification with Self-Pretrained Transformer Models

Low-Rank and Locality Constrained Self-Attention for Sequence Modeling.

Hierarchically Attending Time-Frequency and Channel Features for Improving Speaker Verification

Branch-Transformer: A Parallel Branch Architecture to Capture Local and Global Features for Language Identification

Transformer-based End-to-End Speech Recognition with Local Dense Synthesizer Attention

Enhancing Local Dependencies for Transformer-Based Text-to-Speech via Hybrid Lightweight Convolution

Local Slot Attention for Vision-and-Language Navigation

Attention-Guided Contrastive Masked Image Modeling for Transformer-Based Self-Supervised Learning

P-vectors: A Parallel-Coupled TDNN/Transformer Network for Speaker Verification

Local Multi-Head Channel Self-Attention for Facial Expression Recognition

MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder

Local-to-Global Self-Attention in Vision Transformers

Self-Convolution for Automatic Speech Recognition.

Probing self-attention in self-supervised speech models for cross-linguistic differences