Abstract:With the development of deep learning technology, the pattern of artificial intelligence in education has attracted more and more attention. However, most of the existing verbal interaction analysis methods utilized in the classroom are still in the semi-artificial stage, which lacks intelligence and normality. Therefore, we propose a nested residual network with multi-scale aggregation and speaker attention mechanism, which can distinguish the speech of teachers and students by identifying audio clips in the classroom. Thus, the teaching mode can be analyzed by the verbal interaction between teachers and students. However, the existing method of speaker verification cannot be adapted to the classroom scene, one reason is that the language environment is inconsistent, and the other is the difference in speaker distribution. Therefore, a deep multi-scale aggregation residual network model was proposed, which can ensure the validity of voiceprint information to the greatest extent. A speaker attention mechanism that includes channel-domain and frequency-domain information were introduced to obtain the differences in pronunciation habits and voiceprint amplitude of teachers and students. Experimental results demonstrate that the proposed method achieves outstanding performance with significant learning-capacity, outperforming the state-of-the-art methods. The proposed method obtained a 6.20% accuracy improvement over the compared methods with a 4.00% equal error rate improvement on the English public dataset LibriSpeech. In order to adapt to Chinese classroom, we also proved that the proposed method has good cross-language adaptability through training performance on the Chinese dataset AISHELL. The Experimental results in Chinese classroom shown that the proposed method got a highest improvement 22.70% than other. Our project will be publicly available at http://ecourse.nercel.com.

Self-Attention Networks for Text-Independent Speaker Verification

Self-attention Based Speaker Recognition Using Cluster-Range Loss

End-to-End Feature Learning for Text-Independent Speaker Verification

Bidirectional Attention For Text-Dependent Speaker Verification

End-to-End Attention based Text-Dependent Speaker Verification

Text-Independent Speaker Verification Using Long Short-Term Memory Networks

CNN with Phonetic Attention for Text-Independent Speaker Verification.

MFA: TDNN with Multi-scale Frequency-channel Attention for Text-independent Speaker Verification with Short Utterances

Densely Connected Time Delay Neural Network for Speaker Verification.

Contrastive Learning for improving End-to-end Speaker Verification

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

Improving Deep CNN Networks with Long Temporal Context for Text-Independent Speaker Verification

Towards Robust Speaker Verification with Target Speaker Enhancement

Adder Neural Networks for Speaker Verification

Local Information Modeling with Self-Attention for Speaker Verification

RSKNet-MTSP: Effective and Portable Deep Architecture for Speaker Verification

Self-Convolution for Automatic Speech Recognition

NResNet: nested residual network based on channel and frequency domain attention mechanism for speaker verification in classroom

Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision

Self-Supervised Learning with Cluster-Aware-DINO for High-Performance Robust Speaker Verification