Abstract:This article investigates a relatively underdeveloped subject in natural language processing---the generation of punctuation marks. From a theoretical perspective, we study 16 Chinese punctuation marks as defined in the Chinese national standard of punctuation usage, and categorize these punctuation marks into three different types according to their syntactic properties. We implement a three-tier maximum entropy model incorporating linguistically-motivated features for generating the commonly used Chinese punctuation marks in unpunctuated sentences output by a surface realizer. Furthermore, we present a method to automatically extract cue words indicating sentence-final punctuation marks as a specialized feature to construct a more precise model. Evaluating on the Penn Chinese Treebank data, the MaxEnt model achieves an f -score of 79.83% for punctuation insertion and 74.61% for punctuation restoration using gold data input, 79.50% for insertion and 73.32% for restoration using parser-based imperfect input. The experiments show that the MaxEnt model significantly outperforms a baseline 5-gram language model that scores 54.99% for punctuation insertion and 52.01% for restoration. We show that our results are not far from human performance on the same task with human insertion f -scores in the range of 81-87% and human restoration in the range of 71-82%. Finally, a manual error analysis of the generation output shows that close to 40% of the mismatched punctuation marks do in fact result in acceptable choices, a fact obscured in the automatic string-matching based evaluation scores.

Automatic punctuation generation for speech

Incorporating External POS Tagger for Punctuation Restoration

A Linguistically Inspired Statistical Model for Chinese Punctuation Generation

Unified Multimodal Punctuation Restoration Framework for Mixed-Modality Corpus

Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional Context for Continuous Speech Recognition

Streaming Punctuation for Long-form Dictation with Transformers

Improved Training for End-to-End Streaming Automatic Speech Recognition Model with Punctuation

End-to-end Joint Punctuated and Normalized ASR with a Limited Amount of Punctuated Training Data

Resolving Transcription Ambiguity in Spanish: A Hybrid Acoustic-Lexical System for Punctuation Restoration

Multimodal Punctuation Prediction with Contextual Dropout

Self-Attention Based Model For Punctuation Prediction Using Word And Speech Embeddings

LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models

Evaluating OpenAI's Whisper ASR for Punctuation Prediction and Topic Modeling of life histories of the Museum of the Person

Boosting Punctuation Restoration with Data Generation and Reinforcement Learning

Question Mark Prediction by Bert

Fast and Accurate Capitalization and Punctuation for Automatic Speech Recognition Using Transformer and Chunk Merging

Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin

Punctuation as implicit annotations for chinese word segmentation

Automatic Speech Recognition Post-Processing for Readability: Task, Dataset and a Two-Stage Pre-Trained Approach

Sentence Punctuation for Collaborative Commentary Generation in Esports Live-Streaming

Grammar-Supervised End-to-End Speech Recognition with Part-of-Speech Tagging and Dependency Parsing