Abstract:To transform continuous speech into words, the human brain must resolve variability across utterances in intonation, speech rate, volume, accents and so on. A promising approach to explaining this process has been to model electroencephalogram (EEG) recordings of brain responses to speech. Contemporary models typically invoke context invariant speech categories (e.g. phonemes) as an intermediary representational stage between sounds and words. However, such models may not capture the complete picture because they do not model the brain mechanism that categorizes sounds and consequently may overlook associated neural representations. By providing end-to-end accounts of speech-to-text transformation, new deep-learning systems could enable more complete brain models. We model EEG recordings of audiobook comprehension with the deep-learning speech recognition system Whisper. We find that (1) Whisper provides a self-contained EEG model of an intermediary representational stage that reflects elements of prelexical and lexical representation and prediction; (2) EEG modeling is more accurate when informed by 5-10s of speech context, which traditional context invariant categorical models do not encode; (3) Deep Whisper layers encoding linguistic structure were more accurate EEG models of selectively attended speech in two-speaker "cocktail party" listening conditions than early layers encoding acoustics. No such layer depth advantage was observed for unattended speech, consistent with a more superficial level of linguistic processing in the brain. Most people effortlessly can understand different speakers with distinct voices and accents, no matter whether they whisper, shout or are happy or sad. This effortlessness belies the remarkable computational challenge that our brains solve to transform such variable speech sounds into words. It is believed that our brains deal with this variability by categorizing speech sounds into a flow of phonemes and/or syllable units that are consistent, no matter how a word is spoken. Compelling supporting evidence has come from electrophysiological recordings of brain activity–colloquially known as brain waves—taken as people listen to speech. Scientists have trained computational models to predict brain wave fluctuations that correlate with sequences of phoneme categories. However, modeling only phoneme categories may miss key stages in the recognition process, including how sounds are mapped to phonemes, and phonemes to words. New deep learning speech models that are trained to recognize a diverse range of speech and speakers may offer new opportunities to provide more complete accounts of brain activity. This article reveals that these models indeed predict hitherto unexplained fluctuations in speech brain waves that reflect elements of sub-words and words, and shows that fluctuations are context sensitive, which may reflect the brain anticipating upcoming speech.

A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech

Adaptive Temporal Encoding Leads to a Background-Insensitive Cortical Representation of Speech

Temporal coding of speech in human auditory cortex

Robust cortical encoding of slow temporal modulations of speech.

[The Neural Encoding of Continuous Speech - Recent Advances in EEG and MEG Studies].

Deep-learning models reveal how context and listener attention shape electrophysiological correlates of speech-to-language transformation

Meta predictive learning model of languages in neural circuits

Statistical Learning in Speech: A Biologically Based Predictive Learning Model

Predictive learning shapes the representational geometry of the human brain

Time-warp-invariant neuronal processing

The neural dynamics of auditory word recognition and integration

Dynamic modeling of EEG responses to natural speech reveals earlier processing of predictable words

Appropriate data segmentation improves speech encoding models

Natural language processing models reveal neural dynamics of human conversation

Developmental Predictive Coding Model for Early Infancy Mono and Bilingual Vocal Continual Learning

Recurrent neural networks as neuro-computational models of human speech recognition

Predictive Encoding of Contextual Relationships for Perceptual Inference, Interpolation and Prediction.

Architecture of a Cortex Inspired Hierarchical Event Recaller

Shared computational principles for language processing in humans and deep language models

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? -- A computational investigation