Abstract:Transformer models have demonstrated superior performance across various domains, including computer vision , natural language processing , and speech recognition. The success of these models can be attributed to their robust parallel capacity and high computation speed, primarily reliant on the attention layer. In the domain of speaker recognition, state-of-the-art results have been achieved using convolutional neural network (CNN) architectures, particularly with speaker embeddings represented by x-vectors and r-vectors. However, existing CNN-based methods tend to focus on local features while overlooking the global dependence of voiceprint features, resulting in the loss of crucial information. Moreover, the presence of noise in audio data is an influential factor that cannot be disregarded, as it significantly impacts the extraction of discriminative speaker embeddings. To address these challenges, we propose a novel model called the Multi-Scale Expand Convolution Transformer (MEConformer). This model aims to convert variable-length audio into a fixed low-dimensional representation. The MEConformer leverages a CNN framework with expanded receptive fields to capture frame-level features effectively. Additionally, we introduce a transformer encoder that incorporates contextual dependencies, enabling the extraction of both frame-level and discourse-level feature representations. Furthermore, we present a multi-scale residual aggregation strategy, which facilitates the efficient transmission of voiceprint information across the model. By combining these innovative components, the MEConformer achieves a state-of-the-art Equal Error Rate (EER) of 3.72% on the VoxCeleb1 test set. Furthermore, it demonstrates EERs of 5.94% and 3.72% on the VoxCeleb1-H and VoxCeleb-E datasets, respectively. The code for the proposed MEConformer model will be made publicly available at https://codeocean.com/capsule/4563012/tree .

Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Audio-Visual Efficient Conformer for Robust Speech Recognition

Nextformer: A ConvNeXt Augmented Conformer For End-To-End Speech Recognition

Continual Learning for On-Device Speech Recognition using Disentangled Conformers

Towards A Unified Conformer Structure: from ASR to ASV Task

Pretraining Conformer with ASR or ASV for Anti-Spoofing Countermeasure

Conformer Parrotron: a Faster and Stronger End-to-end Speech Conversion and Recognition Model for Atypical Speech

Research on an improved Conformer end-to-end Speech Recognition Model with R-Drop Structure

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Memory-augmented conformer for improved end-to-end long-form ASR

An Improvement to Conformer-Based Model for High-Accuracy Speech Feature Extraction and Learning

MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification

Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

Self-consistent context aware conformer transducer for speech recognition

Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition

Conformer with dual-mode chunked attention for joint online and offline ASR

Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks

MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder

Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer