Abstract:In this paper, we propose a personalized neural speech codec, envisioning that personalization can reduce the model complexity or improve perceptual speech quality. Despite the common usage of speech codecs where only a single talker is involved on each side of the communication, personalizing a codec for the specific user has rarely been explored in the literature. First, we assume speakers can be grouped into smaller subsets based on their perceptual similarity. Then, we also postulate that a group-specific codec can focus on the group's speech characteristics to improve its perceptual quality and computational efficiency. To this end, we first develop a Siamese network that learns the speaker embeddings from the LibriSpeech dataset, which are then grouped into underlying speaker clusters. Finally, we retrain the LPCNet-based speech codec baselines on each of the speaker clusters. Subjective listening tests show that the proposed personalization scheme introduces model compression while maintaining speech quality. In other words, with the same model complexity, personalized codecs produce better speech quality.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to reduce model complexity and improve perceived speech quality through the Personalized Neural Speech Codec (PNSC). Specifically, the authors propose a personalized - based method aimed at improving the performance of existing neural speech coding techniques on specific users or speakers, especially in the case of low bit - rate and limited computational resources. ### Core issues of the paper 1. **Trade - off between model complexity and coding gain**: Although current Neural Speech Coding (NSC) techniques have achieved significant coding gains at low bit - rates, they are usually accompanied by high model complexity and computational cost. For example, autoregressive models such as WaveNet can generate high - quality speech waveforms, but their inference complexity is as high as 100G floating - point operations per second (FLOPS), which makes them difficult to run on resource - constrained devices. 2. **Need for personalized speech codecs**: Although existing speech codecs usually involve only a single speaker in communication, there is little research on personalized codecs for specific users. The author believes that personalized codecs can improve perceived quality and computational efficiency by focusing on the voice characteristics of specific speakers. ### Solutions To address the above challenges, the author proposes a personalized neural speech codec framework, which mainly includes the following steps: 1. **Speaker embedding learning**: Use Siamese networks to learn speaker embeddings from the LibriSpeech dataset. These embeddings are clustered into several speaker clusters, and speakers within each cluster are perceptually similar. 2. **Personalized decoder training**: Based on the LPCNet baseline model, retrain dedicated decoders for each speaker cluster. These personalized decoders can better adapt to the voice characteristics of specific groups, thus providing better speech quality under the same model complexity. 3. **Subjective auditory test**: Verify the effectiveness of the personalized scheme through subjective auditory tests. The results show that the personalized codec can maintain speech quality while introducing model compression, and even produce better speech quality under the same complexity. ### Formula representation - **Loss function for speaker embedding learning**: \[ L_{\text{emb}} = -\sum_{i,j \sim S(k), \forall k} \log \sigma(z_i^\top z_j)-\sum_{i \sim S(k), j \sim S(k'), k \neq k'} \log (1 - \sigma(z_i^\top z_j)) \] where \(z_i\) and \(z_j\) are embedding vectors from the same or different speakers, and \(\sigma(\cdot)\) is the sigmoid function. - **Cross - entropy loss function**: \[ L_{\text{CE}}(\hat{e}_i \| e_i)=-\sum_{i \in S(k), k \in H(c)} e_i \log \hat{e}_i \] where \(\hat{e}_i\) is the predicted excitation signal and \(e_i\) is the real excitation signal. Through this method, the paper demonstrates the potential of personalized codecs in terms of model compression and perceived quality improvement, especially in resource - constrained environments.

Personalized Neural Speech Codec

A Persona-Based Neural Conversation Model

PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders

Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding

SuperCodec: A Neural Speech Codec with Selective Back-Projection Network

Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation

Personalized Audio Quality Preference Prediction

Neural Feature Predictor and Discriminative Residual Coding for Low-Bitrate Speech Coding

SoundStream: An End-to-End Neural Audio Codec

FreeCodec: A disentangled neural speech codec with fewer tokens

SpatialCodec: Neural Spatial Speech Coding

A High Fidelity and Low Complexity Neural Audio Coding

Ultra-Low-Bitrate Speech Coding with Pretrained Transformers

IBACodec: End-to-end speech codec with intra-inter broad attention

Practical cognitive speech compression

Residual-guided Personalized Speech Synthesis based on Face Image

ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech

Neural Speech Coding for Real-time Communications using Constant Bitrate Scalar Quantization

Personalized Speech Enhancement Without a Separate Speaker Embedding Model

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation