Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

Alexandre R. Ferreira,Cláudio E. C. Campelo
DOI: https://doi.org/10.21528/CBIC2023-169
2023-09-22
Abstract:To train transcriptor models that produce robust results, a large and diverse labeled dataset is required. Finding such data with the necessary characteristics is a challenging task, especially for languages less popular than English. Moreover, producing such data requires significant effort and often money. Therefore, a strategy to mitigate this problem is the use of data augmentation techniques. In this work, we propose a framework that approaches data augmentation based on deepfake audio. To validate the produced framework, experiments were conducted using existing deepfake and transcription models. A voice cloner and a dataset produced by Indians (in English) were selected, ensuring the presence of a single accent in the dataset. Subsequently, the augmented data was used to train speech to text models in various scenarios.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?