Abstract:During face-to-face conversational speech listeners must efficiently process a rapid and complex stream of multisensory information. Visual speech can serve as a critical complement to auditory information because it provides cues to both the timing of the incoming acoustic signal (the amplitude envelope, influencing attention and perceptual sensitivity) and its content (place and manner of articulation, constraining lexical selection). Here we review behavioral and neurophysiological evidence regarding listeners' use of visual speech information. Multisensory integration of audiovisual speech cues improves recognition accuracy, particularly for speech in noise. Even when speech is intelligible based solely on auditory information, adding visual information may reduce the cognitive demands placed on listeners through increasing the precision of prediction. Electrophysiological studies demonstrate that oscillatory cortical entrainment to speech in auditory cortex is enhanced when visual speech is present, increasing sensitivity to important acoustic cues. Neuroimaging studies also suggest increased activity in auditory cortex when congruent visual information is available, but additionally emphasize the involvement of heteromodal regions of posterior superior temporal sulcus as playing a role in integrative processing. We interpret these findings in a framework of temporally-focused lexical competition in which visual speech information affects auditory processing to increase sensitivity to acoustic information through an early integration mechanism, and a late integration stage that incorporates specific information about a speaker's articulators to constrain the number of possible candidates in a spoken utterance. Ultimately it is words compatible with both auditory and visual information that most strongly determine successful speech perception during everyday listening. Thus, audiovisual speech perception is accomplished through multiple stages of integration, supported by distinct neuroanatomical mechanisms.

On the Role of Visual Cues in Audiovisual Speech Enhancement

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Visual Speech Enhancement

An Attention Based Speaker-Independent Audio-Visual Deep Learning Model for Speech Enhancement

An Empirical Study of Visual Features for DNN based Audio-Visual Speech Enhancement in Multi-talker Environments

SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model

Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

Audio-Visual Speech Enhancement in Noisy Environments via Emotion-Based Contextual Cues

Uncovering the Visual Contribution in Audio-Visual Speech Recognition

Rethinking the visual cues in audio-visual speaker extraction

Vision Perceptually Restores Auditory Spectral Dynamics in Speech

Audio-Visual Speech Enhancement Based on Multiscale Features and Parallel Attention

Visual Hallucination Elevates Speech Recognition

Prediction and constraint in audiovisual speech perception

Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues

Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing audio-visual Speech Enhancement

Leveraging Topics and Audio Features with Multimodal Attention for Audio Visual Scene-Aware Dialog

Audio-Visual Speech Enhancement Using Self-supervised Learning to Improve Speech Intelligibility in Cochlear Implant Simulations

Audiovisual Highlight Detection in Videos

From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation