Natural speech re-synthesis from direct cortical recordings using a pre-trained encoder-decoder framework

Jiawei Li,Chunxu Guo,Edward F Chang,Yuanning Li
DOI: https://doi.org/10.1101/2024.12.16.628596
2024-12-17
Abstract:Reconstructing perceived speech stimuli from neural recordings is not only advancing the understanding of the neural coding underlying speech processing but also an important building block for brain-computer interfaces and neuroprosthetics. However, previous attempts to directly re-synthesize speech from neural decoding suffer from low re-synthesis quality. With the limited neural data and complex speech representation space, it is hard to build decoding model that directly map neural signal into high-fidelity speech. In this work, we proposed a pre-trained encoder-decoder framework to address these problems. We recorded high-density electrocorticography (ECoG) signals when participants listening to natural speech. We built a pre-trained speech re-synthesizing network that consists of a context-dependent speech encoding network and a generative adversarial network (GAN) for high-fidelity speech synthesis. This model was pre-trained on a large naturalistic speech corpus and can extract critical features for speech re-synthesize. We then built a light-weight neural decoding network that mapped the ECoG signal into the latent space of the pre-trained network, and used the GAN decoder to synthesize natural speech. Using only 20 minutes of intracranial neural data, our neural-driven speech re-synthesis model demonstrated promising performance, with phoneme error rate (PER) at 28.6%, and human listeners were able to recognize 71.6% of the words in the re-synthesized speech. This work demonstrates the feasibility of using pre-trained self-supervised model and feature alignment to build efficient neural-to-speech decoding model.
Biology
What problem does this paper attempt to address?