Abstract:Abstract The ability to recognize abstract features of voice during auditory perception is a complex, yet poorly understood, feat of human audition. For the listener, this occurs in near-automatic fasion to seamlessly extract complex cues from a highly variable auditory signal. Voice perception depends on specialized regions of auditory cortex, including superior temporal gyrus (STG) and superior temporal sulcus (STS). However, the nature of voice encoding at the cortical level remains poorly understoood. We leverage intracerebral recordings across human auditory cortex during presentation of voice and non-voice acoustic stimuli to examine voice encoding in auditory cortex, in eight patient-participants undergoing epilepsy surgery evaluation. We show that voice-selectivity increases along the auditory hierarchy from supratemporal plane (STP) to the STG and STS. Results show accurate decoding of vocalizations from human auditory cortical activity even in the complete absence of linguistic content. These findings show an early, less-selective temporal window of neural activity in the STG and STS followed by a sustained, strongly voice-selective window. We then developed encoding models that demonstrate divergence in the encoding of acoustic features along the auditory hierarchy, wherein STG/STS responses were best explained by voice category as opposed to the acoustic features of voice stimuli. This is in contrast to neural activity recorded from STP, in which responses were accounted for by acoustic features. These findings support a model of voice perception that engages categorical encoding mechanisms within STG and STS. Significance Statement Voice perception occurs via specialized networks in higher order auditory cortex, yet how voice features are encoded remains a central unanswered question. With human intracerebral recordings of auditory cortex, we provide evidence for categorical encoding of voice in STG and STS and occurs in the absence of linguistic content. This selectivity strengthens after an initial onset response and cannot be explained by simple acoustic features. Together, these data support the existence of sites within STG and STS that are specialized for voice perception.

Reconstructing Voice Identity from Noninvasive Auditory Cortex Recordings

Voice identity invariance by anterior temporal lobe neurons

Voice Identity Recognition: Functional Division of the Right STS and Its Behavioral Relevance

Categorical encoding of voice in human superior temporal cortex

Neural responses in human superior temporal cortex support coding of voice representations

Beyond Voice Identity Conversion: Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations

Hierarchical cortical networks of "voice patches" for processing voices in human brain.

Faces and voices in the brain: A modality-general person-identity representation in superior temporal sulcus

Functional Heterogeneity of Voice-Encoding Cortex Revealed by Clinical Language Mapping

Memorization-Based Training and Testing Paradigm for Robust Vocal Identity Recognition in Expressive Speech Using Event-Related Potentials Analysis

Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders

Towards Voice Reconstruction from EEG during Imagined Speech

Functional and causal neural mechanisms of human voice perception in noisy situations

Task-dependent decoding of speaker and vowel identity from auditory cortical response patterns

Toward a realistic model of speech processing in the brain with self-supervised learning

Voice of Your Brain: Cognitive Representations of Imagined Speech,Overt Speech, and Speech Perception Based on EEG

Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction

Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI

Reconstructing faces from fMRI patterns using deep generative neural networks

Anatomo-functional correspondence in the voice-selective regions of human prefrontal cortex

Decoding Vocal Articulations from Acoustic Latent Representations