Generative Spoken Language Modeling with Quantized Feature Enhancement

Feiyu Duan,Chen Li,Keheng Wang,Si Wu,Chuantao Yin,Wenge Rong
DOI: https://doi.org/10.1109/ijcnn60899.2024.10651390
2024-01-01
Abstract:In the absence of text, training generative models directly on speech data through next token prediction task, similar to text-based language models, has demonstrated its feasibility. However, speech data encompasses more intricate feature information compared to text. To capitalize on these additional features, we propose a feature-enhanced generative spoken language modeling (fGSLM). We calculate the difference between the original speech and its normalized version, and extract quantized features with a VQVAE-structured model. These features are subsequently integrated into the generative spoken language modeling (GSLM) by fine-tuning the unit language model (uLM) through a multi-stream transformer. To evaluate the effectiveness of our model, we conduct experiments on the ProsAudit evaluation task in the Zero Resource Speech Challenge. Experimental results show that our model significantly improves prosody comprehension both at the sentence and lexical levels, and achieves superior performance against baseline models.
What problem does this paper attempt to address?