Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation

Qinglin Zhang,Qian Chen,Yali Li,Jiaqing Liu,Wen Wang

DOI: https://doi.org/10.48550/arXiv.2107.09278

2021-10-09

Abstract:Transcripts generated by automatic speech recognition (ASR) systems for spoken documents lack structural annotations such as paragraphs, significantly reducing their readability. Automatically predicting paragraph segmentation for spoken documents may both improve readability and downstream NLP performance such as summarization and machine reading comprehension. We propose a sequence model with self-adaptive sliding window for accurate and efficient paragraph segmentation. We also propose an approach to exploit phonetic information, which significantly improves robustness of spoken document segmentation to ASR errors. Evaluations are conducted on the English Wiki-727K document segmentation benchmark, a Chinese Wikipedia-based document segmentation dataset we created, and an in-house Chinese spoken document dataset. Our proposed model outperforms the state-of-the-art (SOTA) model based on the same BERT-Base, increasing segmentation F1 on the English benchmark by 4.2 points and on Chinese datasets by 4.3-10.1 points, while reducing inference time to less than 1/6 of inference time of the current SOTA.

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the transcribed texts of spoken - language documents generated by Automatic Speech Recognition (ASR) lack structural annotations (such as paragraph division), which significantly reduces the readability of the transcribed texts and may seriously affect the performance of downstream Natural Language Processing (NLP) tasks (such as summary generation and machine reading comprehension). Specifically, the paper proposes a sequence model with an adaptive sliding window, aiming to accurately and efficiently perform paragraph segmentation of spoken - language documents to improve the readability of the transcribed texts and the performance of subsequent NLP tasks. The main contributions of the paper include: - Proposing a sequence model (called SeqModel) for modeling document segmentation as a sentence - level sequence - labeling task. This model can encode longer contexts without using hierarchical encoding, and it has been observed that using pre - trained models to enhance structural modeling can improve segmentation accuracy. - Proposing an adaptive sliding window method, which further improves the inference efficiency. - Proposing a method based on phoneme embedding, which improves the robustness to ASR errors and increases the F1 score of spoken - language document segmentation by 2.1 - 2.8 points. - Systematic evaluation shows that the SeqModel based on BERT - Base significantly outperforms the current state - of - the - art (SOTA) model cross - segment BERT - Base on the English Wiki - 727K benchmark test and the created Chinese Wikipedia dataset (Wiki - zh), while reducing the inference time to less than one - sixth.

Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation

Segment, Mask, and Predict: Augmenting Chinese Word Segmentation with Self-Supervision

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

Subtitles to Segmentation: Improving Low-Resource Speech-to-Text Translation Pipelines

Automatic Speech Recognition Post-Processing for Readability: Task, Dataset and a Two-Stage Pre-Trained Approach

ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in Paragraph-based TTS

A realistic and robust model for Chinese word segmentation

Hall-effect evolution across a heavy-fermion quantum critical point

Improving Long Document Topic Segmentation Models With Enhanced Coherence Modeling

Domain-Aware Word Segmentation for Chinese Language: A Document-Level Context-Aware Model

Improving Sequence-to-Sequence Acoustic Modeling by Adding Text-Supervision.

Bridging the Gap Between Clean Data Training and Real-World Inference for Spoken Language Understanding

Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR

Improving Chinese Word Segmentation Using Partially Annotated Sentences

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading

Segmenting Subtitles for Correcting ASR Segmentation Errors

A Sentence Segmentation Method for Ancient Chinese Texts Based on NNLM.

A hybrid Chinese word segmentation model for quality management-related texts based on transfer learning

SEGMENT+: Long Text Processing with Short-Context Language Models

Bridging Speech and Textual Pre-trained Models with Unsupervised ASR.