LightCAM: A Fast and Light Implementation of Context-Aware Masking based D-TDNN for Speaker Verification

Di Cao,Xianchen Wang,Junfeng Zhou,Jiakai Zhang,Yanjing Lei,Wenpeng Chen
2024-02-12
Abstract:Traditional Time Delay Neural Networks (TDNN) have achieved state-of-the-art performance at the cost of high computational complexity and slower inference speed, making them difficult to implement in an industrial environment. The Densely Connected Time Delay Neural Network (D-TDNN) with Context Aware Masking (CAM) module has proven to be an efficient structure to reduce complexity while maintaining system performance. In this paper, we propose a fast and lightweight model, LightCAM, which further adopts a depthwise separable convolution module (DSM) and uses multi-scale feature aggregation (MFA) for feature fusion at different levels. Extensive experiments are conducted on VoxCeleb dataset, the comparative results show that it has achieved an EER of 0.83 and MinDCF of 0.0891 in VoxCeleb1-O, which outperforms the other mainstream speaker verification methods. In addition, complexity analysis further demonstrates that the proposed architecture has lower computational cost and faster inference speed.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper proposes a solution to the problem of high computational complexity and slow inference speed of the traditional Time Delay Neural Network (TDNN) in speech recognition. Although the combination of Densely Connected Time Delay Neural Network (D-TDNN) with Context-Aware Mask (CAM) modules has improved efficiency, there is still room for optimization. Therefore, the paper proposes a fast and lightweight model called LightCAM. LightCAM adopts Deep Separable Convolution Module (DSM) to reduce computational complexity, while improving the recognition ability of the model through Multi-Scale Feature Aggregation (MFA) that fuses features at different levels. Experiments were conducted on the VoxCeleb dataset, and the results show that LightCAM achieves lower Equal Error Rate (EER) and Minimum Detection Cost Function (MinDCF) while maintaining high performance. It also has lower computational cost and faster inference speed compared to mainstream methods such as ECAPA-TDNN, ResNet34, and CAM++. LightCAM finds a better balance between performance and complexity. In addition, the analysis of model complexity shows that LightCAM has significant reductions in the number of parameters, floating-point operations (FLOPs), and real-time factor (RTF), especially in terms of RTF, LightCAM has the fastest inference speed among all mainstream methods. Further ablation studies demonstrate the effectiveness of DSM and MFA in improving model performance. Overall, the objective of the paper is to design a fast, lightweight, and efficient speech recognition model. LightCAM achieves this objective by introducing DSM and MFA, providing a better solution for practical applications in industrial environments.