Abstract:Objectives: Watching a talker’s mouth is beneficial for speech reception (SR) in many communication settings, especially in noise and when hearing is impaired. Measures for audiovisual (AV) SR can be valuable in the framework of diagnosing or treating hearing disorders. This study addresses the lack of standardized methods in many languages for assessing lipreading, AV gain, and integration. A new method is validated that supplements a German speech audiometric test with visualizations of the synthetic articulation of an avatar that was used, for it is feasible to lip-sync auditory speech in a highly standardized way. Three hypotheses were formed according to the literature on AV SR that used live or filmed talkers. It was tested whether respective effects could be reproduced with synthetic articulation: (1) cochlear implant (CI) users have a higher visual-only SR than normal-hearing (NH) individuals, and younger individuals obtain higher lipreading scores than older persons. (2) Both CI and NH gain from presenting AV over unimodal (auditory or visual) sentences in noise. (3) Both CI and NH listeners efficiently integrate complementary auditory and visual speech features. Design: In a controlled, cross-sectional study with 14 experienced CI users (mean age 47.4) and 14 NH individuals (mean age 46.3, similar broad age distribution), lipreading, AV gain, and integration of a German matrix sentence test were assessed. Visual speech stimuli were synthesized by the articulation of the Talking Head system “MASSY” (Modular Audiovisual Speech Synthesizer), which displayed standardized articulation with respect to the visibility of German phones. Results: In line with the hypotheses and previous literature, CI users had a higher mean visual-only SR than NH individuals (CI, 38%; NH, 12%; p < 0.001). Age was correlated with lipreading such that within each group, younger individuals obtained higher visual-only scores than older persons (rCI = −0.54; p = 0.046; rNH = −0.78; p < 0.001). Both CI and NH benefitted by AV over unimodal speech as indexed by calculations of the measures visual enhancement and auditory enhancement (each p < 0.001). Both groups efficiently integrated complementary auditory and visual speech features as indexed by calculations of the measure integration enhancement (each p < 0.005). Conclusions: Given the good agreement between results from literature and the outcome of supplementing an existing validated auditory test with synthetic visual cues, the introduced method can be considered an interesting candidate for clinical and scientific applications to assess measures important for AV SR in a standardized manner. This could be beneficial for optimizing the diagnosis and treatment of individual listening and communication disorders, such as cochlear implantation.

Development of speechreading supplements based on automatic speech recognition

Speech neuromuscular decoding based on spectrogram images using conformal predictors with Bi-LSTM.

Silent Speech Decoding Using Spectrogram Features Based on Neuromuscular Activities

Automatic Detection of the Temporal Segmentation of Hand Movements in British English Cued Speech

Seeing speech: neural mechanisms of cued speech perception in prelingually deaf and hearing users

Silenttalk: Lip Reading Through Ultrasonic Sensing on Mobile Phones

Investigating the dynamics of hand and lips in French Cued Speech using attention mechanisms and CTC-based decoding

A Strategic Approach for Robust Dysarthric Speech Recognition

A Novel Interpretable and Generalizable Re-synchronization Model for Cued Speech based on a Multi-Cuer Corpus

Re-Synchronization Using the Hand Preceding Model for Multi-Modal Fusion in Automatic Continuous Cued Speech Recognition

Automatic Speech Recognition and its Visual Perception Via a Cymatics Based Display

Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition

A New Re-synchronization Method Based Multi-modal Fusion for Automatic Continuous Cued Speech Recognition

Multistream neural architectures for cued-speech recognition using a pre-trained visual feature extractor and constrained CTC decoding

Bridge to Non-Barrier Communication: Gloss-Prompted Fine-grained Cued Speech Gesture Generation with Diffusion Model

Validating a Method to Assess Lipreading, Audiovisual Gain, and Integration During Speech Reception With Cochlear-Implanted and Normal-Hearing Subjects Using a Talking Head

Speech decoding using cortical and subcortical electrophysiological signals

Community-Supported Shared Infrastructure in Support of Speech Accessibility

Reading Miscue Detection in Primary School through Automatic Speech Recognition

Decoding Silent Speech Commands from Articulatory Movements Through Soft Magnetic Skin and Machine Learning

Exploring Retraining-Free Speech Recognition for Intra-sentential Code-Switching