Deep-learning models reveal how context and listener attention shape electrophysiological correlates of speech-to-language transformation

Andrew J. Anderson,Chris Davis,Edmund C. Lalor
DOI: https://doi.org/10.1371/journal.pcbi.1012537
2024-11-13
PLoS Computational Biology
Abstract:To transform continuous speech into words, the human brain must resolve variability across utterances in intonation, speech rate, volume, accents and so on. A promising approach to explaining this process has been to model electroencephalogram (EEG) recordings of brain responses to speech. Contemporary models typically invoke context invariant speech categories (e.g. phonemes) as an intermediary representational stage between sounds and words. However, such models may not capture the complete picture because they do not model the brain mechanism that categorizes sounds and consequently may overlook associated neural representations. By providing end-to-end accounts of speech-to-text transformation, new deep-learning systems could enable more complete brain models. We model EEG recordings of audiobook comprehension with the deep-learning speech recognition system Whisper. We find that (1) Whisper provides a self-contained EEG model of an intermediary representational stage that reflects elements of prelexical and lexical representation and prediction; (2) EEG modeling is more accurate when informed by 5-10s of speech context, which traditional context invariant categorical models do not encode; (3) Deep Whisper layers encoding linguistic structure were more accurate EEG models of selectively attended speech in two-speaker "cocktail party" listening conditions than early layers encoding acoustics. No such layer depth advantage was observed for unattended speech, consistent with a more superficial level of linguistic processing in the brain. Most people effortlessly can understand different speakers with distinct voices and accents, no matter whether they whisper, shout or are happy or sad. This effortlessness belies the remarkable computational challenge that our brains solve to transform such variable speech sounds into words. It is believed that our brains deal with this variability by categorizing speech sounds into a flow of phonemes and/or syllable units that are consistent, no matter how a word is spoken. Compelling supporting evidence has come from electrophysiological recordings of brain activity–colloquially known as brain waves—taken as people listen to speech. Scientists have trained computational models to predict brain wave fluctuations that correlate with sequences of phoneme categories. However, modeling only phoneme categories may miss key stages in the recognition process, including how sounds are mapped to phonemes, and phonemes to words. New deep learning speech models that are trained to recognize a diverse range of speech and speakers may offer new opportunities to provide more complete accounts of brain activity. This article reveals that these models indeed predict hitherto unexplained fluctuations in speech brain waves that reflect elements of sub-words and words, and shows that fluctuations are context sensitive, which may reflect the brain anticipating upcoming speech.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?