Abstract:Convolutional neural networks (CNNs), such as the time-delay neural network (TDNN), have shown their remarkable capability in learning speaker embedding. However, they meanwhile bring a huge computational cost in storage size, processing, and memory. Discovering the specialized CNN that meets a specific constraint requires a substantial effort of human experts. Compared with hand-designed approaches, neural architecture search (NAS) appears as a practical technique in automating the manual architecture design process and has attracted increasing interest in spoken language processing tasks such as speaker recognition. In this paper, we propose EfficientTDNN, an efficient architecture search framework consisting of a TDNN-based supernet and a TDNN-NAS algorithm. The proposed supernet introduces temporal convolution of different ranges of the receptive field and feature aggregation of various resolutions from different layers to TDNN. On top of it, the TDNN-NAS algorithm quickly searches for the desired TDNN architecture via weight-sharing subnets, which surprisingly reduces computation while handling the vast number of devices with various resources requirements. Experimental results on the VoxCeleb dataset show the proposed EfficientTDNN enables approximate $10^{13}$ architectures concerning depth, kernel, and width. Considering different computation constraints, it achieves a 2.20% equal error rate (EER) with 204 M multiply-accumulate operations (MACs), 1.41% EER with 571 M MACs as well as 0.94% EER with 1.45 G MACs. Comprehensive investigations suggest that the trained supernet generalizes subnets not sampled during training and obtains a favorable trade-off between accuracy and efficiency.

Compressed Time Delay Neural Network for Small-Footprint Keyword Spotting

A Time Delay Neural Network with Shared Weight Self-Attention for Small-Footprint Keyword Spotting

Compact Feedforward Sequential Memory Networks For Small-Footprint Keyword Spotting

An Empirical Study of Cross-Lingual Transfer Learning Techniques for Small-Footprint Keyword Spotting.

Model compression applied to small-footprint keyword spotting

On the Application and Compression of Deep Time Delay Neural Network for Embedded Statistical Parametric Speech Synthesis

Small-Footprint Keyword Spotting with Multi-Scale Temporal Convolution

Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting

TIME-DELAYED BOTTLENECK HIGHWAY NETWORKS USING A DFT FEATURE FOR KEYWORD SPOTTING

Small-footprint Keyword Spotting with Graph Convolutional Network

Depthwise Separable Convolutional ResNet with Squeeze-and-Excitation Blocks for Small-footprint Keyword Spotting

Low-Bit Quantization and Quantization-Aware Training for Small-Footprint Keyword Spotting

Low-complex and Highly-performed Binary Residual Neural Network for Small-footprint Keyword Spotting

Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting

Max-pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting

A real-time small target detection network

Shrinking Your TimeStep: Towards Low-Latency Neuromorphic Object Recognition with Spiking Neural Network

TB-DNN: A Thin Binarized Deep Neural Network with High Accuracy

Delay learning based on temporal coding in Spiking Neural Networks

EfficientTDNN: Efficient Architecture Search for Speaker Recognition