LightCAM: A Fast and Light Implementation of Context-Aware Masking based D-TDNN for Speaker Verification

Di Cao,Xianchen Wang,Junfeng Zhou,Jiakai Zhang,Yanjing Lei,Wenpeng Chen

2024-02-12

Abstract:Traditional Time Delay Neural Networks (TDNN) have achieved state-of-the-art performance at the cost of high computational complexity and slower inference speed, making them difficult to implement in an industrial environment. The Densely Connected Time Delay Neural Network (D-TDNN) with Context Aware Masking (CAM) module has proven to be an efficient structure to reduce complexity while maintaining system performance. In this paper, we propose a fast and lightweight model, LightCAM, which further adopts a depthwise separable convolution module (DSM) and uses multi-scale feature aggregation (MFA) for feature fusion at different levels. Extensive experiments are conducted on VoxCeleb dataset, the comparative results show that it has achieved an EER of 0.83 and MinDCF of 0.0891 in VoxCeleb1-O, which outperforms the other mainstream speaker verification methods. In addition, complexity analysis further demonstrates that the proposed architecture has lower computational cost and faster inference speed.

Computation and Language,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper proposes a solution to the problem of high computational complexity and slow inference speed of the traditional Time Delay Neural Network (TDNN) in speech recognition. Although the combination of Densely Connected Time Delay Neural Network (D-TDNN) with Context-Aware Mask (CAM) modules has improved efficiency, there is still room for optimization. Therefore, the paper proposes a fast and lightweight model called LightCAM. LightCAM adopts Deep Separable Convolution Module (DSM) to reduce computational complexity, while improving the recognition ability of the model through Multi-Scale Feature Aggregation (MFA) that fuses features at different levels. Experiments were conducted on the VoxCeleb dataset, and the results show that LightCAM achieves lower Equal Error Rate (EER) and Minimum Detection Cost Function (MinDCF) while maintaining high performance. It also has lower computational cost and faster inference speed compared to mainstream methods such as ECAPA-TDNN, ResNet34, and CAM++. LightCAM finds a better balance between performance and complexity. In addition, the analysis of model complexity shows that LightCAM has significant reductions in the number of parameters, floating-point operations (FLOPs), and real-time factor (RTF), especially in terms of RTF, LightCAM has the fastest inference speed among all mainstream methods. Further ablation studies demonstrate the effectiveness of DSM and MFA in improving model performance. Overall, the objective of the paper is to design a fast, lightweight, and efficient speech recognition model. LightCAM achieves this objective by introducing DSM and MFA, providing a better solution for practical applications in industrial environments.

LightCAM: A Fast and Light Implementation of Context-Aware Masking based D-TDNN for Speaker Verification

CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking

Cam: Context-Aware Masking for Robust Speaker Verification.

TMS: Temporal multi-scale in time-delay neural network for speaker verification

DS-TDNN: Dual-stream Time-delay Neural Network with Global-aware Filter for Speaker Verification

Densely Connected Time Delay Neural Network for Speaker Verification.

PCF: ECAPA-TDNN with Progressive Channel Fusion for Speaker Verification

NeXt-TDNN: Modernizing Multi-Scale Temporal Convolution Backbone for Speaker Verification

MFA: TDNN with Multi-scale Frequency-channel Attention for Text-independent Speaker Verification with Short Utterances

Self-attention Based Speaker Recognition Using Cluster-Range Loss

P-vectors: A Parallel-Coupled TDNN/Transformer Network for Speaker Verification

Deep Speaker Feature Learning for Text-independent Speaker Verification

Attention and DCT based Global Context Modeling for Text-independent Speaker Recognition

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

Lightweight Speaker Verification Using Transformation Module with Feature Partition and Fusion

EfficientTDNN: Efficient Architecture Search for Speaker Recognition

Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations

End-to-End Feature Learning for Text-Independent Speaker Verification

Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification

A focus module-based lightweight end-to-end CNN framework for voiceprint recognition