Abstract:In the field of human-computer interaction, the current more advanced speech recognition systems are all single speech recognition, and it is urgent to adopt new in-depth learning technology to improve the existing speech recognition system. In this context, this research is based on DNN and investigates mixed speech recognition techniques for both Chinese and English. A single speech recognition algorithm based on DNN is first investigated, and then a new hybrid Chinese and English speech recognition model is constructed by fusing the attention mechanism and CTC loss function. In the construction of the hybrid speech recognition model, the end-to-end model and Transformer framework are used to combine the monotonic alignment property of the CTC loss function, which allows complex sound units to be transformed into characters for easy extraction and recognition. The performance of the constructed models was tested on Chinese speech dataset, English speech dataset and mixed Chinese and English speech dataset to determine the recognition accuracy and speed of the models. The results show that the proposed recognition model achieves 81.2% recognition accuracy and 100 recognition speed/minute on the Chinese-English mixed speech dataset, which is much better than the other three models. This study successfully addresses the need for improved speech recognition systems by introducing a novel hybrid model for mixed Chinese-English speech recognition. The experimental results confirm the superiority of the proposed model, achieving high accuracy and rapid recognition speed. The developed model holds promising potential for enhancing human-computer interaction and enabling efficient communication between Chinese and English speakers.

ASKCC-DCNN-CTC: A Multi-Core Two Dimensional Causal Convolution Fusion Network with Attention Mechanism for End-to-End Speech Recognition

Residual Convolutional CTC Networks for Automatic Speech Recognition.

A Convenient and Extensible Offline Chinese Speech Recognition System Based on Convolutional CTC Networks

CACnet: Cube Attentional CNN for Automatic Speech Recognition

A hybrid CTC+Attention model based on end-to-end framework for multilingual speech recognition

Multi-Scale TCN: Exploring Better Temporal DNN Model for Causal Speech Enhancement.

Pyramid Multi-branch Fusion DCNN with Multi-Head Self-Attention for Mandarin Speech Recognition

A Chinese Acoustic Model Based On Convolutional Neural Network

End-to-End Speech Recognition Model Based on Dilated Sparse Aware Network

Attention-Based Gated Scaling Adaptive Acoustic Model for CTC-Based Speech Recognition.

2D-to-2d Mask Estimation for Speech Enhancement Based on Fully Convolutional Neural Network

Cascaded CNN-resBiLSTM-CTC: an End-to-End Acoustic Model for Speech Recognition.

CR-CTC: Consistency regularization on CTC for improved speech recognition

3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

Deep Neural Network-based Mixed Speech Recognition Technology for Chinese and English

Multi-LCNN: A Hybrid Neural Network Based on Integrated Time-Frequency Characteristics for Acoustic Scene Classification.

TFCN: Temporal-Frequential Convolutional Network for Single-Channel Speech Enhancement

TC-SKNet with GridMask for Low-complexity Classification of Acoustic Scene

Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition

kNN-CTC: Enhancing ASR via Retrieval of CTC Pseudo Labels

Coarse-Grained Attention Fusion with Joint Training Framework for Complex Speech Enhancement and End-to-End Speech Recognition