Personalized Neural Speech Codec

Inseon Jang,Haici Yang,Wootaek Lim,Seungkwon Beack,Minje Kim
DOI: https://doi.org/10.1109/ICASSP48485.2024.10446067
2024-04-01
Abstract:In this paper, we propose a personalized neural speech codec, envisioning that personalization can reduce the model complexity or improve perceptual speech quality. Despite the common usage of speech codecs where only a single talker is involved on each side of the communication, personalizing a codec for the specific user has rarely been explored in the literature. First, we assume speakers can be grouped into smaller subsets based on their perceptual similarity. Then, we also postulate that a group-specific codec can focus on the group's speech characteristics to improve its perceptual quality and computational efficiency. To this end, we first develop a Siamese network that learns the speaker embeddings from the LibriSpeech dataset, which are then grouped into underlying speaker clusters. Finally, we retrain the LPCNet-based speech codec baselines on each of the speaker clusters. Subjective listening tests show that the proposed personalization scheme introduces model compression while maintaining speech quality. In other words, with the same model complexity, personalized codecs produce better speech quality.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to reduce model complexity and improve perceived speech quality through the Personalized Neural Speech Codec (PNSC). Specifically, the authors propose a personalized - based method aimed at improving the performance of existing neural speech coding techniques on specific users or speakers, especially in the case of low bit - rate and limited computational resources. ### Core issues of the paper 1. **Trade - off between model complexity and coding gain**: Although current Neural Speech Coding (NSC) techniques have achieved significant coding gains at low bit - rates, they are usually accompanied by high model complexity and computational cost. For example, autoregressive models such as WaveNet can generate high - quality speech waveforms, but their inference complexity is as high as 100G floating - point operations per second (FLOPS), which makes them difficult to run on resource - constrained devices. 2. **Need for personalized speech codecs**: Although existing speech codecs usually involve only a single speaker in communication, there is little research on personalized codecs for specific users. The author believes that personalized codecs can improve perceived quality and computational efficiency by focusing on the voice characteristics of specific speakers. ### Solutions To address the above challenges, the author proposes a personalized neural speech codec framework, which mainly includes the following steps: 1. **Speaker embedding learning**: Use Siamese networks to learn speaker embeddings from the LibriSpeech dataset. These embeddings are clustered into several speaker clusters, and speakers within each cluster are perceptually similar. 2. **Personalized decoder training**: Based on the LPCNet baseline model, retrain dedicated decoders for each speaker cluster. These personalized decoders can better adapt to the voice characteristics of specific groups, thus providing better speech quality under the same model complexity. 3. **Subjective auditory test**: Verify the effectiveness of the personalized scheme through subjective auditory tests. The results show that the personalized codec can maintain speech quality while introducing model compression, and even produce better speech quality under the same complexity. ### Formula representation - **Loss function for speaker embedding learning**: \[ L_{\text{emb}} = -\sum_{i,j \sim S(k), \forall k} \log \sigma(z_i^\top z_j)-\sum_{i \sim S(k), j \sim S(k'), k \neq k'} \log (1 - \sigma(z_i^\top z_j)) \] where \(z_i\) and \(z_j\) are embedding vectors from the same or different speakers, and \(\sigma(\cdot)\) is the sigmoid function. - **Cross - entropy loss function**: \[ L_{\text{CE}}(\hat{e}_i \| e_i)=-\sum_{i \in S(k), k \in H(c)} e_i \log \hat{e}_i \] where \(\hat{e}_i\) is the predicted excitation signal and \(e_i\) is the real excitation signal. Through this method, the paper demonstrates the potential of personalized codecs in terms of model compression and perceived quality improvement, especially in resource - constrained environments.