Abstract:Recent speaker verification (SV) systems have shown a trend toward adopting deeper speaker embedding extractors. Although deeper and larger neural networks can significantly improve performance, their substantial memory requirements hinder training on consumer GPUs. In this paper, we explore a memory-efficient training strategy for deep speaker embedding learning in resource-constrained scenarios. Firstly, we conduct a systematic analysis of GPU memory allocation during SV system training. Empirical observations show that activations and optimizer states are the main sources of memory consumption. For activations, we design two types of reversible neural networks which eliminate the need to store intermediate activations during back-propagation, thereby significantly reducing memory usage without performance loss. For optimizer states, we introduce a dynamic quantization approach that replaces the original 32-bit floating-point values with a dynamic tree-based 8-bit data type. Experimental results on VoxCeleb demonstrate that the reversible variants of ResNets and DF-ResNets can perform training without the need to cache activations in GPU memory. In addition, the 8-bit versions of SGD and Adam save 75% of memory costs while maintaining performance compared to their 32-bit counterparts. Finally, a detailed comparison of memory usage and performance indicates that our proposed models achieve up to 16.2x memory savings, with nearly identical parameters and performance compared to the vanilla systems. In contrast to the previous need for multiple high-end GPUs such as the A100, we can effectively train deep speaker embedding extractors with just one or two consumer-level 2080Ti GPUs.

GPU Accelerated GMM Supervectors for Speaker and Language Recognition

Design of a GMM Vector Multiplier Based on Two-dimensional Systolic Array

Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification

GPU-FV: Realtime Fisher Vector and Its Applications in Video Monitoring

GPU-Accelerated Viterbi Exact Lattice Decoder for Batched Online and Offline Speech Recognition

Accelerating the Training of HTK on GPU with CUDA

Fast MPEG-CDVS Encoder with GPU-CPU Hybrid Computing.

High Throughput MIMO-OFDM Detection with Graphics Processing Units

Exploiting Glottal Information in Speaker Recognition Using Parallel GMMs

A High Performance FPGA-Based Accelerator Design for End-to-End Speaker Recognition System

GPU Accelerated Computation for Surface Topography Measurement

GPU-accelerated Guided Source Separation for Meeting Transcription

Three-level GPU Accelerated Gaussian Mixture Model for Background Subtraction.

Accelerating Support Vector Machine Learning With Gpu-Based Mapreduce

GPU Based Fast MPEG-CDVS Encoder.

GPU-based Fast Processing for a Distributed Acoustic Sensor Using an LFM Pulse

Memory-Constrained Vectorization and Scheduling of Dataflow Graphs for Hybrid CPU-GPU Platforms

GPU accelerated face detection

Exponential Moving Average Model in Parallel Speech Recognition Training

A Practical Implementation of GPU based Accelerator for Deep Neural Networks

Accelerating Video Decoding Using GPU