Abstract:With the development of deep learning technology, the pattern of artificial intelligence in education has attracted more and more attention. However, most of the existing verbal interaction analysis methods utilized in the classroom are still in the semi-artificial stage, which lacks intelligence and normality. Therefore, we propose a nested residual network with multi-scale aggregation and speaker attention mechanism, which can distinguish the speech of teachers and students by identifying audio clips in the classroom. Thus, the teaching mode can be analyzed by the verbal interaction between teachers and students. However, the existing method of speaker verification cannot be adapted to the classroom scene, one reason is that the language environment is inconsistent, and the other is the difference in speaker distribution. Therefore, a deep multi-scale aggregation residual network model was proposed, which can ensure the validity of voiceprint information to the greatest extent. A speaker attention mechanism that includes channel-domain and frequency-domain information were introduced to obtain the differences in pronunciation habits and voiceprint amplitude of teachers and students. Experimental results demonstrate that the proposed method achieves outstanding performance with significant learning-capacity, outperforming the state-of-the-art methods. The proposed method obtained a 6.20% accuracy improvement over the compared methods with a 4.00% equal error rate improvement on the English public dataset LibriSpeech. In order to adapt to Chinese classroom, we also proved that the proposed method has good cross-language adaptability through training performance on the Chinese dataset AISHELL. The Experimental results in Chinese classroom shown that the proposed method got a highest improvement 22.70% than other. Our project will be publicly available at http://ecourse.nercel.com.

A Speaker Recognition Method Based on Stable Learning.

Speaker recognition based on improved ECAPA-TDNN network

Speaker recognition with two-step multi-modal deep cleansing

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

Self-attention Based Speaker Recognition Using Cluster-Range Loss

Speaker recognition based on deep learning: An overview

Research on Voiceprint Recognition Technology Based on Deep Neural Network

Improved deep speaker feature learning for text-dependent speaker recognition

Speaker recognition using Improved Butterfly Optimization Algorithm with hybrid Long Short Term Memory network

Self-Supervised Learning with Cluster-Aware-DINO for High-Performance Robust Speaker Verification

Standardized Evaluation Method of Pronunciation Teaching Based on Deep Learning

Speaker Recognition Based on Pre-Trained Model and Deep Clustering

Contrastive Learning for improving End-to-end Speaker Verification

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

DLD: An Optimized Chinese Speech Recognition Model Based on Deep Learning

High-Level CNN and Machine Learning Methods for Speaker Recognition

NResNet: nested residual network based on channel and frequency domain attention mechanism for speaker verification in classroom

Look, Listen and Learn - A Multimodal LSTM for Speaker Identification