Abstract:Speech-to-text (S2T) generation systems frequently face challenges in low-resource scenarios, primarily due to the lack of extensive labeled datasets. One emerging solution is constructing virtual training samples by interpolating inputs and labels, which has notably enhanced system generalization in other domains. Despite its potential, this technique's application in S2T tasks has remained under-explored. In this paper, we delve into the utility of interpolation augmentation, guided by several pivotal questions. Our findings reveal that employing an appropriate strategy in interpolation augmentation significantly enhances performance across diverse tasks, architectures, and data scales, offering a promising avenue for more robust S2T systems in resource-constrained settings.

What problem does this paper attempt to address?

The paper primarily aims to address the challenges faced by Speech-to-Text (S2T) generation systems in low-resource scenarios, particularly the issues arising from the lack of large amounts of annotated data. To tackle this problem, the researchers explored the application of Interpolation Augmentation (IPA) technology, a method that constructs virtual training samples by linearly interpolating input features and labels, which can significantly improve the system's generalization ability. Specifically, the paper attempts to answer the following key questions: 1. **What is the appropriate interpolation strategy?** What are the differences in handling speech features and text embeddings with interpolation augmentation? 2. **How can interpolation augmentation be effectively combined with existing augmentation techniques (such as SpecAugment)?** 3. **What specific issues exist in applying interpolation augmentation to S2T tasks, and how can these issues be resolved?** 4. **How does interpolation augmentation perform in different scenarios?** To delve into these questions, the paper conducted a series of experiments and proposed two interpolation strategies: one that directly interpolates word embeddings at the decoder input layer (Embedding Interpolation, EIP), and another that interpolates at the encoder input while keeping the decoder input unchanged. Additionally, the paper explored the combination of interpolation augmentation with existing data augmentation techniques (such as SpecAugment) and proposed a new method called "Appending-based Interpolation Augmentation (AIPA)" to mitigate distribution shift issues. Furthermore, the paper introduced the concept of Constraint Objective Space (COS) to simplify the complexity in the CTC learning process. Through these methods, the researchers found that interpolation augmentation technology can effectively enhance the performance of S2T systems, especially in resource-constrained situations. Moreover, the paper discussed the application effects of interpolation augmentation technology on different architectures (such as Encoder-Decoder and Encoder-CTC), different data scales (from LibriSpeech 10 hours to 960 hours datasets), and different model backends (such as Transformer and Conformer models). Overall, the research results indicate that the optimized interpolation augmentation settings are not only suitable for low-resource environments but also achieve good results in high-resource scenarios.

Revisiting Interpolation Augmentation for Speech-to-Text Generation

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Data Augmentation for End-to-end Code-switching Speech Recognition

Text Generation with Speech Synthesis for ASR Data Augmentation

Data Augmentation for Code-Switch Language Modeling by Fusing Multiple Text Generation Methods.

Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation

Speech Recognition with Augmented Synthesized Speech

Improving Low Resource Code-switched ASR using Augmented Code-switched TTS

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

Speech Synthesis as Augmentation for Low-Resource ASR

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition

Improving Speech-to-Speech Translation Through Unlabeled Text

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification

When Is TTS Augmentation Through a Pivot Language Useful?

Code-Switching Text Generation and Injection in Mandarin-English ASR

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding