Multi-Task Learning for Front-End Text Processing in TTS

Wonjune Kang,Yun Wang,Shun Zhang,Arthur Hinsvark,Qing He
DOI: https://doi.org/10.1109/ICASSP48485.2024.10446241
2024-01-12
Abstract:We propose a multi-task learning (MTL) model for jointly performing three tasks that are commonly solved in a text-to-speech (TTS) front-end: text normalization (TN), part-of-speech (POS) tagging, and homograph disambiguation (HD). Our framework utilizes a tree-like structure with a trunk that learns shared representations, followed by separate task-specific heads. We further incorporate a pre-trained language model to utilize its built-in lexical and contextual knowledge, and study how to best use its embeddings so as to most effectively benefit our multi-task model. Through task-wise ablations, we show that our full model trained on all three tasks achieves the strongest overall performance compared to models trained on individual or sub-combinations of tasks, confirming the advantages of our MTL framework. Finally, we introduce a new HD dataset containing a balanced number of sentences in diverse contexts for a variety of homographs and their pronunciations. We demonstrate that incorporating this dataset into training significantly improves HD performance over only using a commonly used, but imbalanced, pre-existing dataset.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to simultaneously perform three common tasks in text - to - speech (TTS) front - end processing through a multi - task learning (MTL) model: text normalization (TN), part - of - speech tagging (POS), and homophone disambiguation (HD). Specifically, these three tasks are respectively responsible for converting non - standard characters into their spoken forms, tagging the part - of - speech of each word in the text, and determining the correct pronunciation of homophones in the TTS system. Although these tasks are usually trained and used separately in most TTS pipelines, the author believes that they can benefit from each other through shared representations, especially because these tasks are all based on the same input text. Therefore, the paper proposes a tree - structured multi - task learning model, which has a shared backbone for extracting general features and task - specific heads, aiming to utilize the commonalities among tasks and improve overall performance. In addition, the paper also introduces a new homophone disambiguation dataset, which contains balanced sentences of various homophones and their pronunciations to improve the performance of homophone disambiguation.