Vision Perceptually Restores Auditory Spectral Dynamics in Speech

John Plass,David Brang,Satoru Suzuki,Marcia Grabowecky
DOI: https://doi.org/10.31234/osf.io/t954p
2019-05-20
Abstract:Visual speech facilitates auditory speech perception, but the visual cues responsible for these effects and the crossmodal information they provide remain unclear. Because visible articulators shape the spectral content of auditory speech, we hypothesized that listeners may be able to extract spectrotemporal information from visual speech to facilitate auditory speech perception. To uncover statistical regularities that could subserve such facilitations, we compared the resonant frequency of the oral cavity to the shape of the oral aperture during speech. We found that the time-frequency dynamics of oral resonances could be recovered with unexpectedly high precision from the shape of the mouth during speech. Because both auditory frequency modulations and visual shape properties are neurally encoded as mid-level perceptual features, we hypothesized that this feature-level correspondence would allow for spectrotemporal information to be recovered from visual speech without reference to higher order (e.g., phonemic) speech representations. Isolating these features from other speech cues, we found that speech-based shape deformations improved sensitivity for corresponding frequency modulations, suggesting that the perceptual system exploits crossmodal correlations in mid-level feature representations to enhance speech perception. To test whether this correspondence could be used to improve comprehension, we selectively degraded the spectral or temporal dimensions of auditory sentence spectrograms to assess how well visual speech facilitated comprehension under each degradation condition. Visual speech produced drastically larger enhancements during spectral degradation, suggesting a condition-specific facilitation effect driven by crossmodal recovery of auditory speech spectra. Visual speech may therefore facilitate perception by crossmodally restoring degraded spectrotemporal signals in speech.
What problem does this paper attempt to address?