Abstract:Owing to the loss of effective information and incomplete feature extraction caused by the convolution and pooling operations in a convolution subsampling network, the accuracy and speed of current speech processing architectures based on the conformer model are influenced because the shallow features of speech signals are not completely extracted. To solve these problems, in this study, we researched a method that used a capsule network to improve the accuracy of feature extraction in a conformer-based model, and then, we proposed a new end-to-end model architecture for speech recognition. First, to improve the accuracy of speech feature extraction, a capsule network with a dynamic routing mechanism was introduced into the conformer model; thus, the structural information in speech was preserved, and it was input to the conformer blocks via sequestered vectors; the learning ability of the conformed-based model was significantly enhanced using dynamic weight updating. Second, a residual network was added to the capsule blocks, thus, the mapping ability of our model was improved and the training difficulty was reduced. Furthermore, the bi-transformer model was adopted in the decoding network to promote the consistency of the hypotheses in different directions through bidirectional modeling. Finally, the effectiveness and robustness of the proposed model were verified against different types of recognition models by performing multiple sets of experiments. The experimental results demonstrated that our speech recognition model achieved a lower word error rate without a language model because of the higher accuracy of speech feature extraction and learning using our model architecture with a capsule network. Furthermore, our model architecture benefited from the advantage of the capsule network and the conformer encoder, and also has potential for other speech-related applications.

VHF Speech Recognition Model Based on Improved Conformer Structure

An Improvement to Conformer-Based Model for High-Accuracy Speech Feature Extraction and Learning

Nextformer: A ConvNeXt Augmented Conformer For End-To-End Speech Recognition

Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer

SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models

Pretraining Conformer with ASR or ASV for Anti-Spoofing Countermeasure

Research on an improved Conformer end-to-end Speech Recognition Model with R-Drop Structure

Robust Wake Word Spotting With Frame-Level Cross-Modal Attention Based Audio-Visual Conformer

Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr

Audio-Visual Efficient Conformer for Robust Speech Recognition

Deep-FSMN for Large Vocabulary Continuous Speech Recognition

Multi-pass Training and Cross-information Fusion for Low-resource End-to-end Accented Speech Recognition

Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers

DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module

Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

FusionFormer: Fusing Operations in Transformer for Efficient Streaming Speech Recognition

Consonant/vowel(C/V) speech classification using high-rank function neural network (HRFNN)

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model

Advances in Cantonese Speech Recognition: A Language-Specific Pretraining Model and RNN-T Loss

Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Convoifilter: A case study of doing cocktail party speech recognition