Abstract:Most current very low bit rate VLBR speech coding systems use hidden Markov model HMM based speech recognition and synthesis techniques. This allows transmission of information such as phonemes segment by segment; this decreases the bit rate. However, an encoder based on a phoneme speech recognition may create bursts of segmental errors; these would be further propagated to any suprasegmental such as syllable information coding. Together with the errors of voicing detection in pitch parametrization, HMM-based speech coding leads to speech discontinuities and unnatural speech sound artifacts. In this paper, we propose a novel VLBR speech coding framework based on neural networks NNs for end-to-end speech analysis and synthesis without HMMs. The speech coding framework relies on a phonological subphonetic representation of speech. It is designed as a composition of deep and spiking NNs: a bank of phonological analyzers at the transmitter, and a phonological synthesizer at the receiver. These are both realized as deep NNs, along with a spiking NN as an incremental and robust encoder of syllable boundaries for coding of continuous fundamental frequency F0. A combination of phonological features defines much more sound patterns than phonetic features defined by HMM-based speech coders; this finer analysis/synthesis code contributes to smoother encoded speech. Listeners significantly prefer the NN-based approach due to fewer discontinuities and speech artifacts of the encoded speech. A single forward pass is required during the speech encoding and decoding. The proposed VLBR speech coding operates at a bit rate of approximately 360 bits/s.

Practical cognitive speech compression

Ultra-Low-Bitrate Speech Coding with Pretrained Transformers

Composition of Deep and Spiking Neural Networks for Very Low Bit Rate Speech Coding

Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding

PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders

SuperCodec: A Neural Speech Codec with Selective Back-Projection Network

CQNV: A combination of coarsely quantized bitstream and neural vocoder for low rate speech coding

Neural Speech Coding for Real-time Communications using Constant Bitrate Scalar Quantization

IBACodec: End-to-end speech codec with intra-inter broad attention

An Intra-BRNN and GB-RVQ Based END-TO-END Neural Audio Codec

Low Bit-Rate Speech Coding with VQ-VAE and a WaveNet Decoder

SoundStream: An End-to-End Neural Audio Codec

Variational Speech Waveform Compression to Catalyze Semantic Communications

Variable-rate Neural Speech Compression with Multi-scale Feature Extraction and Improved Entropy Modeling

A High Fidelity and Low Complexity Neural Audio Coding

A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate

Analysis by Adversarial Synthesis -- A Novel Approach for Speech Vocoding

Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis

Advancing The Rate-Distortion-Computation Frontier For Neural Image Compression

Srcodec: Split-Residual Vector Quantization for Neural Speech Codec.

AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec