Revisiting Interpolation Augmentation for Speech-to-Text Generation

Chen Xu,Jie Wang,Xiaoqian Liu,Qianqian Dong,Chunliang Zhang,Tong Xiao,Jingbo Zhu,Dapeng Man,Wu Yang
2024-06-22
Abstract:Speech-to-text (S2T) generation systems frequently face challenges in low-resource scenarios, primarily due to the lack of extensive labeled datasets. One emerging solution is constructing virtual training samples by interpolating inputs and labels, which has notably enhanced system generalization in other domains. Despite its potential, this technique's application in S2T tasks has remained under-explored. In this paper, we delve into the utility of interpolation augmentation, guided by several pivotal questions. Our findings reveal that employing an appropriate strategy in interpolation augmentation significantly enhances performance across diverse tasks, architectures, and data scales, offering a promising avenue for more robust S2T systems in resource-constrained settings.
Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily aims to address the challenges faced by Speech-to-Text (S2T) generation systems in low-resource scenarios, particularly the issues arising from the lack of large amounts of annotated data. To tackle this problem, the researchers explored the application of Interpolation Augmentation (IPA) technology, a method that constructs virtual training samples by linearly interpolating input features and labels, which can significantly improve the system's generalization ability. Specifically, the paper attempts to answer the following key questions: 1. **What is the appropriate interpolation strategy?** What are the differences in handling speech features and text embeddings with interpolation augmentation? 2. **How can interpolation augmentation be effectively combined with existing augmentation techniques (such as SpecAugment)?** 3. **What specific issues exist in applying interpolation augmentation to S2T tasks, and how can these issues be resolved?** 4. **How does interpolation augmentation perform in different scenarios?** To delve into these questions, the paper conducted a series of experiments and proposed two interpolation strategies: one that directly interpolates word embeddings at the decoder input layer (Embedding Interpolation, EIP), and another that interpolates at the encoder input while keeping the decoder input unchanged. Additionally, the paper explored the combination of interpolation augmentation with existing data augmentation techniques (such as SpecAugment) and proposed a new method called "Appending-based Interpolation Augmentation (AIPA)" to mitigate distribution shift issues. Furthermore, the paper introduced the concept of Constraint Objective Space (COS) to simplify the complexity in the CTC learning process. Through these methods, the researchers found that interpolation augmentation technology can effectively enhance the performance of S2T systems, especially in resource-constrained situations. Moreover, the paper discussed the application effects of interpolation augmentation technology on different architectures (such as Encoder-Decoder and Encoder-CTC), different data scales (from LibriSpeech 10 hours to 960 hours datasets), and different model backends (such as Transformer and Conformer models). Overall, the research results indicate that the optimized interpolation augmentation settings are not only suitable for low-resource environments but also achieve good results in high-resource scenarios.