Abstract:Transformer models have demonstrated superior performance across various domains, including computer vision , natural language processing , and speech recognition. The success of these models can be attributed to their robust parallel capacity and high computation speed, primarily reliant on the attention layer. In the domain of speaker recognition, state-of-the-art results have been achieved using convolutional neural network (CNN) architectures, particularly with speaker embeddings represented by x-vectors and r-vectors. However, existing CNN-based methods tend to focus on local features while overlooking the global dependence of voiceprint features, resulting in the loss of crucial information. Moreover, the presence of noise in audio data is an influential factor that cannot be disregarded, as it significantly impacts the extraction of discriminative speaker embeddings. To address these challenges, we propose a novel model called the Multi-Scale Expand Convolution Transformer (MEConformer). This model aims to convert variable-length audio into a fixed low-dimensional representation. The MEConformer leverages a CNN framework with expanded receptive fields to capture frame-level features effectively. Additionally, we introduce a transformer encoder that incorporates contextual dependencies, enabling the extraction of both frame-level and discourse-level feature representations. Furthermore, we present a multi-scale residual aggregation strategy, which facilitates the efficient transmission of voiceprint information across the model. By combining these innovative components, the MEConformer achieves a state-of-the-art Equal Error Rate (EER) of 3.72% on the VoxCeleb1 test set. Furthermore, it demonstrates EERs of 5.94% and 3.72% on the VoxCeleb1-H and VoxCeleb-E datasets, respectively. The code for the proposed MEConformer model will be made publicly available at https://codeocean.com/capsule/4563012/tree .

Introducing Multilingual Phonetic Information to Speaker Embedding for Speaker Verification

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Introducing Phonetic Information to Speaker Embedding for Speaker Verification

VarASV: Enabling Pitch-variable Automatic Speaker Verification Via Multi-task Learning

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Phonetic-aware speaker embedding for far-field speaker verification

Multilingual and Crosslingual Speech Recognition Using Phonological-Vector Based Phone Embeddings

Leveraging ASR Pretrained Conformers for Speaker Verification through Transfer Learning and Knowledge Distillation

MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder

Tackling the Score Shift in Cross-Lingual Speaker Verification by Exploiting Language Information

Wav2sv: End-to-end Speaker Embeddings Learning from Raw Waveforms Based on Metric Learning for Speaker Verification.

MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification

Speaker Embedding Extraction with Phonetic Information

VOT: Revolutionizing Speaker Verification with Memory and Attention Mechanisms

Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification

The SpeakIn Speaker Verification System for Far-Field Speaker Verification Challenge 2022

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

Dual-model self-regularization and fusion for domain adaptation of robust speaker verification

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer