Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation

Qinglin Zhang,Qian Chen,Yali Li,Jiaqing Liu,Wen Wang
DOI: https://doi.org/10.48550/arXiv.2107.09278
2021-10-09
Abstract:Transcripts generated by automatic speech recognition (ASR) systems for spoken documents lack structural annotations such as paragraphs, significantly reducing their readability. Automatically predicting paragraph segmentation for spoken documents may both improve readability and downstream NLP performance such as summarization and machine reading comprehension. We propose a sequence model with self-adaptive sliding window for accurate and efficient paragraph segmentation. We also propose an approach to exploit phonetic information, which significantly improves robustness of spoken document segmentation to ASR errors. Evaluations are conducted on the English Wiki-727K document segmentation benchmark, a Chinese Wikipedia-based document segmentation dataset we created, and an in-house Chinese spoken document dataset. Our proposed model outperforms the state-of-the-art (SOTA) model based on the same BERT-Base, increasing segmentation F1 on the English benchmark by 4.2 points and on Chinese datasets by 4.3-10.1 points, while reducing inference time to less than 1/6 of inference time of the current SOTA.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the transcribed texts of spoken - language documents generated by Automatic Speech Recognition (ASR) lack structural annotations (such as paragraph division), which significantly reduces the readability of the transcribed texts and may seriously affect the performance of downstream Natural Language Processing (NLP) tasks (such as summary generation and machine reading comprehension). Specifically, the paper proposes a sequence model with an adaptive sliding window, aiming to accurately and efficiently perform paragraph segmentation of spoken - language documents to improve the readability of the transcribed texts and the performance of subsequent NLP tasks. The main contributions of the paper include: - Proposing a sequence model (called SeqModel) for modeling document segmentation as a sentence - level sequence - labeling task. This model can encode longer contexts without using hierarchical encoding, and it has been observed that using pre - trained models to enhance structural modeling can improve segmentation accuracy. - Proposing an adaptive sliding window method, which further improves the inference efficiency. - Proposing a method based on phoneme embedding, which improves the robustness to ASR errors and increases the F1 score of spoken - language document segmentation by 2.1 - 2.8 points. - Systematic evaluation shows that the SeqModel based on BERT - Base significantly outperforms the current state - of - the - art (SOTA) model cross - segment BERT - Base on the English Wiki - 727K benchmark test and the created Chinese Wikipedia dataset (Wiki - zh), while reducing the inference time to less than one - sixth.