Abstract:The speaker encoder is an important front-end module that explores discriminative speaker features for many speech applications requiring speaker information. Current speaker encoders aggregate multi-scale features from utterances using multi-branch network architectures. However, naively adding many branches through a fully convolutional operation cannot efficiently improve its capability to capture multi-scale features due to the problem of rapid increase of model parameters and computational complexity. Therefore, in current network architectures, only a few branches corresponding to a limited number of temporal scales are designed for capturing speaker features. To address this problem, this paper proposes an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker encoder while negligibly increasing computational costs. The TMS model is based on a time-delay neural network (TDNN), where the network architecture is separated into channel-modeling and temporal multi-branch modeling operators. In the TMS model, adding temporal multi-scale elements in the temporal multi-branch operator only slightly increases the model's parameters, thus saving more of the computational budget to add branches with large temporal scales. After model training, we further develop a systemic re-parameterization method to convert the multi-branch network topology into a single-path-based topology to increase the inference speed.We conducted automatic speaker verification (ASV) experiments under in-domain (VoxCeleb) and out-of-domain (CNCeleb) conditions to investigate the proposed TMS model's performance.Experimental results show that the TMS-method-based model outperformed state-of-the-art ASV models (e.g., ECAPA-TDNN) and improved robustness. Moreover, the proposed model achieved a 29%–46% increase in the inference speed compared to ECAPA-TDNN.

Group Multi-Scale convolutional Network for Monaural Speech Enhancement in Time-domain

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

PCNN: A Lightweight Parallel Conformer Neural Network for Efficient Monaural Speech Enhancement

TMS: Temporal multi-scale in time-delay neural network for speaker verification

Inter-channel Conv-TasNet for multichannel speech enhancement

TFCN: Temporal-Frequential Convolutional Network for Single-Channel Speech Enhancement

Convolutional fusion network for monaural speech enhancement

Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks

Time domain speech enhancement with CNN and time-attention transformer

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

ScaleFormer: Transformer-based Speech Enhancement in the Multi-Scale Time Domain

Multi-scale Feature Based Convolutional Neural Networks for Large Vocabulary Speech Recognition

A Multi-scale Subconvolutional U-Net with Time-Frequency Attention Mechanism for Single Channel Speech Enhancement

Convolutional gated recurrent unit networks based real-time monaural speech enhancement

FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement

Inplace Gated Convolutional Recurrent Neural Network For Dual-channel Speech Enhancement

CompNet: Complementary Network for Single-Channel Speech Enhancement.

Efficient Monaural Speech Separation with Multiscale Time-Delay Sampling