Abstract: Speaker embedding is an important front-end module to explore discriminative speaker features for many speech applications where speaker information is needed. Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation. However, naively adding many branches of multi-scale features with the simple fully convolutional operation could not efficiently improve the performance due to the rapid increase of model parameters and computational complexity. Therefore, in the most current state-of-the-art network architectures, only a few branches corresponding to a limited number of temporal scales could be designed for speaker embeddings. To address this problem, in this paper, we propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs. The new model is based on the conventional TDNN, where the network architecture is smartly separated into two modeling operators: a channel-modeling operator and a temporal multi-branch modeling operator. Adding temporal multi-scale in the temporal multi-branch operator needs only a little bit increase of the number of parameters, and thus save more computational budget for adding more branches with large temporal scales. Moreover, in the inference stage, we further developed a systemic re-parameterization method to convert the TMS-based model into a single-path-based topology in order to increase inference speed. We investigated the performance of the new TMS method for automatic speaker verification (ASV) on in-domain and out-of-domain conditions. Results show that the TMS-based model obtained a significant increase in the performance over the SOTA ASV models, meanwhile, had a faster inference speed.

A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models

Spiking Structured State Space Model for Monaural Speech Enhancement

An End-to-End Speech Enhancement Framework Using Stacked Multi-scale Blocks.

Dual-Branch Modeling Based on State-Space Model for Speech Enhancement

Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement

SICRN: Advancing Speech Enhancement through State Space Model and Inplace Convolution Techniques

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Modulating State Space Model with SlowFast Framework for Compute-Efficient Ultra Low-Latency Speech Enhancement

A Neural State-Space Modeling Approach to Efficient Speech Separation

Augmenting conformers with structured state-space sequence models for online speech recognition

A Neural State-Space Model Approach to Efficient Speech Separation

Selective State Space Model for Monaural Speech Enhancement

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Monaural Speech Enhancement with Deep Residual-Dense Lattice Network and Attention Mechanism in the Time Domain

Lightweight Full-band and Sub-band Fusion Network for Real Time Speech Enhancement

Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement

Decoupled Spatial and Temporal Processing for Resource Efficient Multichannel Speech Enhancement

Efficient Multi-Channel Speech Enhancement with Spherical Harmonics Injection for Directional Encoding

TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding

Multi-target Ensemble Learning Based Speech Enhancement with Temporal-Spectral Structured Target

A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement with Compact Neural Network Architectures