Abstract:Although many efforts have been made on decreasing the model complexity for speaker verification, it is still challenging to deploy speaker verification systems with satisfactory result on low-resource terminals. We design a transformation module that performs feature partition and fusion to implement lightweight speaker verification. The transformation module consists of multiple simple but effective operations, such as convolution, pooling, mean, concatenation, normalization, and element-wise summation. It works in a plug-and-play way, and can be easily implanted into a wide variety of models to reduce the model complexity while maintaining the model error. First, the input feature is split into several low-dimensional feature subsets for decreasing the model complexity. Then, each feature subset is updated by fusing it with the inter-feature-subsets correlational information to enhance its representational capability. Finally, the updated feature subsets are independently fed into the block (one or several layers) of the model for further processing. The features that are output from current block of the model are processed according to the steps above before they are fed into the next block of the model. Experimental data are selected from two public speech corpora (namely VoxCeleb1 and VoxCeleb2). Results show that implanting the transformation module into three models (namely AMCRN, ResNet34, and ECAPA-TDNN) for speaker verification slightly increases the model error and significantly decreases the model complexity. Our proposed method outperforms baseline methods on the whole in memory requirement and computational complexity with lower equal error rate. It also generalizes well across truncated segments with various lengths.

Papez: Resource-Efficient Speech Separation with Auditory Working Memory

Resource-Efficient Separation Transformer

TransMask: A Compact and Fast Speech Separation Model Based on Transformer

Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation

Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition

TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation

SETransformer: Speech Enhancement Transformer

Monaural Multi-Speaker Speech Separation Using Efficient Transformer Model

Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages

Efficient, Cluster-Informed, Deep Speech Separation with Cross-Cluster Information in AD-HOC Wireless Acoustic Sensor Networks

SepMamba: State-space models for speaker separation using Mamba

A 1.6-mW Sparse Deep Learning Accelerator for Speech Separation

Decoupling Pronunciation and Language for End-to-End Code-Switching Automatic Speech Recognition.

Extremely Low Footprint End-to-End ASR System for Smart Device

Exploring Self-Attention Mechanisms for Speech Separation

SPGM: Prioritizing Local Features for enhanced speech separation performance

SPMamba: State-space model is all you need in speech separation

Lightweight Speaker Verification Using Transformation Module with Feature Partition and Fusion

Efficient time-domain speech separation using short encoded sequence network

Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning

Scaling strategies for on-device low-complexity source separation with Conv-Tasnet