Abstract:Transformer models have demonstrated superior performance across various domains, including computer vision , natural language processing , and speech recognition. The success of these models can be attributed to their robust parallel capacity and high computation speed, primarily reliant on the attention layer. In the domain of speaker recognition, state-of-the-art results have been achieved using convolutional neural network (CNN) architectures, particularly with speaker embeddings represented by x-vectors and r-vectors. However, existing CNN-based methods tend to focus on local features while overlooking the global dependence of voiceprint features, resulting in the loss of crucial information. Moreover, the presence of noise in audio data is an influential factor that cannot be disregarded, as it significantly impacts the extraction of discriminative speaker embeddings. To address these challenges, we propose a novel model called the Multi-Scale Expand Convolution Transformer (MEConformer). This model aims to convert variable-length audio into a fixed low-dimensional representation. The MEConformer leverages a CNN framework with expanded receptive fields to capture frame-level features effectively. Additionally, we introduce a transformer encoder that incorporates contextual dependencies, enabling the extraction of both frame-level and discourse-level feature representations. Furthermore, we present a multi-scale residual aggregation strategy, which facilitates the efficient transmission of voiceprint information across the model. By combining these innovative components, the MEConformer achieves a state-of-the-art Equal Error Rate (EER) of 3.72% on the VoxCeleb1 test set. Furthermore, it demonstrates EERs of 5.94% and 3.72% on the VoxCeleb1-H and VoxCeleb-E datasets, respectively. The code for the proposed MEConformer model will be made publicly available at https://codeocean.com/capsule/4563012/tree .

RECENT DEVELOPMENTS ON ESPNET TOOLKIT BOOSTED BY CONFORMER

THE 2020 ESPNET UPDATE: NEW FEATURES, BROADENED APPLICATIONS, PERFORMANCE IMPROVEMENTS, AND FUTURE PLANS

Espnet-se: end-to-end speech enhancement and separation toolkit designed for asr integration

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration

ESPnet2-TTS: Extending the Edge of TTS Research

ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks

ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

Conformer Parrotron: a Faster and Stronger End-to-end Speech Conversion and Recognition Model for Atypical Speech

Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

Nextformer: A ConvNeXt Augmented Conformer For End-To-End Speech Recognition

Towards A Unified Conformer Structure: from ASR to ASV Task

Improved Conformer-based End-to-End Speech Recognition Using Neural Architecture Search

Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Efficient End-to-End Speech Recognition Using Performers in Conformers

ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech

MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder

CompNet: Complementary Network for Single-Channel Speech Enhancement.

Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm