Abstract:To transform continuous speech into words, the human brain must resolve variability across utterances in intonation, speech rate, volume, accents and so on. A promising approach to explaining this process has been to model electroencephalogram (EEG) recordings of brain responses to speech. Contemporary models typically invoke context invariant speech categories (e.g. phonemes) as an intermediary representational stage between sounds and words. However, such models may not capture the complete picture because they do not model the brain mechanism that categorizes sounds and consequently may overlook associated neural representations. By providing end-to-end accounts of speech-to-text transformation, new deep-learning systems could enable more complete brain models. We model EEG recordings of audiobook comprehension with the deep-learning speech recognition system Whisper. We find that (1) Whisper provides a self-contained EEG model of an intermediary representational stage that reflects elements of prelexical and lexical representation and prediction; (2) EEG modeling is more accurate when informed by 5-10s of speech context, which traditional context invariant categorical models do not encode; (3) Deep Whisper layers encoding linguistic structure were more accurate EEG models of selectively attended speech in two-speaker "cocktail party" listening conditions than early layers encoding acoustics. No such layer depth advantage was observed for unattended speech, consistent with a more superficial level of linguistic processing in the brain. Most people effortlessly can understand different speakers with distinct voices and accents, no matter whether they whisper, shout or are happy or sad. This effortlessness belies the remarkable computational challenge that our brains solve to transform such variable speech sounds into words. It is believed that our brains deal with this variability by categorizing speech sounds into a flow of phonemes and/or syllable units that are consistent, no matter how a word is spoken. Compelling supporting evidence has come from electrophysiological recordings of brain activity–colloquially known as brain waves—taken as people listen to speech. Scientists have trained computational models to predict brain wave fluctuations that correlate with sequences of phoneme categories. However, modeling only phoneme categories may miss key stages in the recognition process, including how sounds are mapped to phonemes, and phonemes to words. New deep learning speech models that are trained to recognize a diverse range of speech and speakers may offer new opportunities to provide more complete accounts of brain activity. This article reveals that these models indeed predict hitherto unexplained fluctuations in speech brain waves that reflect elements of sub-words and words, and shows that fluctuations are context sensitive, which may reflect the brain anticipating upcoming speech.

A model of infant speech perception and learning

Modeling early phonetic acquisition from child-centered audio data

Evaluating computational models of infant phonetic learning across languages

A computational model of early language acquisition from audiovisual experiences of young infants

InfantNet: A Deep Neural Network for Analyzing Infant Vocalizations

Statistical Learning in Speech: A Biologically Based Predictive Learning Model

A model of early word acquisition based on realistic-scale audiovisual naming events

Learning to Produce Syllabic Speech Sounds via Reward-Modulated Neural Plasticity

An open-source voice type classifier for child-centered daylong recordings

Perception Point: Identifying Critical Learning Periods in Speech for Bilingual Networks

Learning to imitate facial expressions through sound

A memory-driven auditory program ensures selective and precise vocal imitation in zebra finches

Deep-learning models reveal how context and listener attention shape electrophysiological correlates of speech-to-language transformation

Examining speech-brain tracking during early bidirectional, free-flowing caregiver-infant interactions

Decoding speech information from EEG data with 4, 7 and 11 month-old infants: Contrasting convolutional neural network, mutual information-based and backward linear models

Learning the sound inventory of a complex vocal skill via an intrinsic reward

Exploring the effectiveness of reward-based learning strategies for second-language speech sounds

Decoding speech information from EEG data with 4-, 7- and 11-month-old infants: Using convolutional neural network, mutual information-based and backward linear models

Assessing language acquisition from parent-child interaction: An event-related potential study on perception of audio-visual cues in infancy

A developmental model of audio-visual attention (MAVA) for bimodal language learning in infants and robots

The formation of perceptual space in early phonetic acquisition: a cross-linguistic modeling approach