Abstract:Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality. We propose a method called "Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP)", which uses the cross-modal aligned sequence transcoder to bring text and speech into a joint multimodal space, learning how to connect text and speech at the frame level. The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, offering a promising solution for fine-grained generation and recognition tasks in speech processing. The VQ-CTAP can be directly applied to VC and ASR tasks without fine-tuning or additional structures. We propose a sequence-aware semantic connector, which connects multiple frozen pre-trained modules for the TTS task, exhibiting a plug-and-play capability. We design a stepping optimization strategy to ensure effective model convergence by gradually injecting and adjusting the influence of various loss components. Furthermore, we propose a semantic-transfer-wise paralinguistic consistency loss to enhance representational capabilities, allowing the model to better generalize to unseen data and capture the nuances of paralinguistic information. In addition, VQ-CTAP achieves high-compression speech coding at a rate of 25Hz from 24kHz input waveforms, which is a 960-fold reduction in the sampling rate. The audio demo is available at <a class="link-external link-https" href="https://qiangchunyu.github.io/VQCTAP/" rel="external noopener nofollow">this https URL</a>

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

Learning Speech Representation From Contrastive Token-Acoustic Pretraining

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

Non-parallel Sequence-to-Sequence Voice Conversion for Arbitrary Speakers.

Connectionist Temporal Classification Loss for Vector Quantized Variational Autoencoder in Zero-Shot Voice Conversion

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Avqvc: One-Shot Voice Conversion By Vector Quantization With Applying Contrastive Learning

DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation

Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning

Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling

Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

Pre-training for Speech Translation: CTC Meets Optimal Transport

Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling

Multi-level Temporal-channel Speaker Retrieval for Zero-shot Voice Conversion

Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature.

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature