Abstract:In the field of human-computer interaction, the current more advanced speech recognition systems are all single speech recognition, and it is urgent to adopt new in-depth learning technology to improve the existing speech recognition system. In this context, this research is based on DNN and investigates mixed speech recognition techniques for both Chinese and English. A single speech recognition algorithm based on DNN is first investigated, and then a new hybrid Chinese and English speech recognition model is constructed by fusing the attention mechanism and CTC loss function. In the construction of the hybrid speech recognition model, the end-to-end model and Transformer framework are used to combine the monotonic alignment property of the CTC loss function, which allows complex sound units to be transformed into characters for easy extraction and recognition. The performance of the constructed models was tested on Chinese speech dataset, English speech dataset and mixed Chinese and English speech dataset to determine the recognition accuracy and speed of the models. The results show that the proposed recognition model achieves 81.2% recognition accuracy and 100 recognition speed/minute on the Chinese-English mixed speech dataset, which is much better than the other three models. This study successfully addresses the need for improved speech recognition systems by introducing a novel hybrid model for mixed Chinese-English speech recognition. The experimental results confirm the superiority of the proposed model, achieving high accuracy and rapid recognition speed. The developed model holds promising potential for enhancing human-computer interaction and enabling efficient communication between Chinese and English speakers.

Integrated Method of Deep Learning and Large Language Model in Speech Recognition

Deep LSTM for Large Vocabulary Continuous Speech Recognition

Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study

A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR

Research on the Application and Optimization Strategies of Deep Learning in Large Language Models

Using Large Language Model for End-to-End Chinese ASR and NER

Deep Neural Network-based Mixed Speech Recognition Technology for Chinese and English

Deep Neural Networks Language Model Based on CNN and LSTM Hybrid Architecture

Building DNN acoustic models for large vocabulary speech recognition

Deep joint learning for language recognition

On decoder-only architecture for speech-to-text and large language model integration

Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

An Empirical Study of Language Model Integration for Transducer Based Speech Recognition

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Enhancing CTC-based speech recognition with diverse modeling units

A Survey on Speech Large Language Models

Acoustic Model Fusion for End-to-end Speech Recognition

DLD: An Optimized Chinese Speech Recognition Model Based on Deep Learning