Abstract:The speaker encoder is an important front-end module that explores discriminative speaker features for many speech applications requiring speaker information. Current speaker encoders aggregate multi-scale features from utterances using multi-branch network architectures. However, naively adding many branches through a fully convolutional operation cannot efficiently improve its capability to capture multi-scale features due to the problem of rapid increase of model parameters and computational complexity. Therefore, in current network architectures, only a few branches corresponding to a limited number of temporal scales are designed for capturing speaker features. To address this problem, this paper proposes an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker encoder while negligibly increasing computational costs. The TMS model is based on a time-delay neural network (TDNN), where the network architecture is separated into channel-modeling and temporal multi-branch modeling operators. In the TMS model, adding temporal multi-scale elements in the temporal multi-branch operator only slightly increases the model's parameters, thus saving more of the computational budget to add branches with large temporal scales. After model training, we further develop a systemic re-parameterization method to convert the multi-branch network topology into a single-path-based topology to increase the inference speed.We conducted automatic speaker verification (ASV) experiments under in-domain (VoxCeleb) and out-of-domain (CNCeleb) conditions to investigate the proposed TMS model's performance.Experimental results show that the TMS-method-based model outperformed state-of-the-art ASV models (e.g., ECAPA-TDNN) and improved robustness. Moreover, the proposed model achieved a 29%–46% increase in the inference speed compared to ECAPA-TDNN.

Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Speaker-conditioning Single-channel Target Speaker Extraction using Conformer-based Architectures

Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature Analysis

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

Separate-to-Recognize: Joint Multi-target Speech Separation and Speech Recognition for Speaker-attributed ASR

Real-time End-to-End Monaural Multi-speaker Speech Recognition

Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment

t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability

Audio-Visual Efficient Conformer for Robust Speech Recognition

META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR

End-to-end Monaural Multi-speaker ASR System Without Pretraining.

RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

Improving End-to-End Single-Channel Multi-Talker Speech Recognition.

TMS: Temporal multi-scale in time-delay neural network for speaker verification

End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

End-to-End Speech Recognition Model Based on Dilated Sparse Aware Network

Complex Neural Spatial Filter: Enhancing Multi-channel Target Speech Separation in Complex Domain

SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

End-to-End Joint Target and Non-Target Speakers ASR