Boosting Punctuation Restoration with Data Generation and Reinforcement Learning

Viet Dac Lai,Abel Salinas,Hao Tan,Trung Bui,Quan Tran,Seunghyun Yoon,Hanieh Deilamsalehy,Franck Dernoncourt,Thien Huu Nguyen
2023-07-25
Abstract:Punctuation restoration is an important task in automatic speech recognition (ASR) which aim to restore the syntactic structure of generated ASR texts to improve readability. While punctuated texts are abundant from written documents, the discrepancy between written punctuated texts and ASR texts limits the usability of written texts in training punctuation restoration systems for ASR texts. This paper proposes a reinforcement learning method to exploit in-topic written texts and recent advances in large pre-trained generative language models to bridge this gap. The experiments show that our method achieves state-of-the-art performance on the ASR test set on two benchmark datasets for punctuation restoration.
Computation and Language
What problem does this paper attempt to address?
This paper attempts to solve the problem of punctuation restoration in the text generated by Automatic Speech Recognition (ASR). Specifically, the paper focuses on how to use topic - related written texts and recent large pre - trained generative language models to improve the performance of the punctuation restoration system through the reinforcement learning method. This is mainly because it is not ideal to directly use the existing written text data to train the punctuation restoration system, since there are significant differences between these written texts and the texts generated by ASR. The latter contains a large amount of noise, such as oral pauses and word errors in the transcription process. Therefore, the paper proposes a new data generation method, combined with the reinforcement learning framework, to generate large - scale, high - quality labeled data, thereby effectively improving the performance of the punctuation restoration task on the actual ASR texts.