Abstract:The relationships between muscle movements and neural signals make it possible to decode silent speech based on neuromuscular activities. The decoding can be formulated as a supervised classification task. The electromyography (EMG) captured from surface articulatory muscles contains useful information that can help assist in decoding of speech. Spectrograms obtained from EMG have a wealth of information relating to the decoding, but have not yet been fully explored. In addition, the decoding results are often uncertain. Therefore, it is important to quantify the prediction confidence. This paper aims to improve the decoding performance by representing time series signals as spectrograms and utilising Inductive Conformal Prediction (ICP) to provide predictions with confidence. All EMG data are recorded on six dedicated facial muscles while participants recite the displayed words subvocally. Three pre-trained convolutional models of MobileNet-V1, ResNet18 and Xception are used to extract features from spectrograms for classification. Both bidirectional Long-Short Time Memory (Bi-LSTM) and Gate Recurrent Unit (GRU) classifiers are used for prediction. Furthermore, an ICP decoder based on Bi-LSTM is built to provide guaranteed predictions for each example at a specified confidence level. The proposed method of combining feature extraction based on Xception and classification using Bi-LSTM gives a higher accuracy of 0.87 than other methods. ICP outputs confidence measurements for each example that can help users to evaluate the reliability of new predictions. Experimental results demonstrate the practical usefulness in decoding articulatory neuromuscular activity and the advantages in applying ICP.

Multimodal neural pronunciation modeling for spoken languages with logographic origin

A High Accuracy Approach for Word-Phoneme Translation Using Neural Networks

Speech neuromuscular decoding based on spectrogram images using conformal predictors with Bi-LSTM.

Hierarchical Character Embeddings: Learning Phonological and Semantic Representations in Languages of Logographic Origin using Recursive Neural Networks

Effective Phoneme Decoding With Hyperbolic Neural Networks for High-Performance Speech BCIs

Decoding Chinese phonemes from intracortical brain signals with hyperbolic-space neural representations

A Modularized Neural Network with Language-Specific Output Layers for Cross-lingual Voice Conversion

CMCI: A Robust Multimodal Fusion Method for Spiking Neural Networks

Pronunciation Assessment with Multi-modal Large Language Models

Acoustic inspired brain-to-sentence decoder for logosyllabic language

Logographic Information Aids Learning Better Representations for Natural Language Inference

Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation

Phonology-Augmented Statistical Framework for Machine Transliteration Using Limited Linguistic Resources

Neural Language Codes for Multilingual Acoustic Models

MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model

A brain-to-text framework for decoding natural tonal sentences

Multimodal Input Aids a Bayesian Model of Phonetic Learning

Cross-Modal Language Modeling in Multi-Motion-Informed Context for Lip Reading

Impacts of multicollinearity on CAPT modalities: An heterogeneous machine learning framework for computer-assisted French phoneme pronunciation training

English-to-Chinese Transliteration with Phonetic Back-transliteration

Improving Accented Mandarin Speech Recognition by Using Recurrent Neural Network Based Language Model Adaptation