Abstract:Prosody and prosodic boundaries carry significant information regarding linguistics and paralinguistics and are important aspects of speech. In the field of prosodic event detection, many local acoustic features have been investigated; however, contextual information has not yet been thoroughly exploited. The most difficult aspect of this lies in learning the long-distance contextual dependencies effectively and efficiently. To address this problem, we introduce the use of an algorithm called auto-context. In this algorithm, a classifier is first trained based on a set of local acoustic features, after which the generated probabilities are used along with the local features as contextual information to train new classifiers. By iteratively using updated probabilities as the contextual information, the algorithm can accurately model contextual dependencies and improve classification ability. The advantages of this method include its flexible structure and the ability of capturing contextual relationships. When using the auto-context algorithm based on support vector machine, we can improve the detection accuracy by about 3% and F-score by more than 7% on both two-way and four-way pitch accent detections in combination with the acoustic context. For boundary detection, the accuracy improvement is about 1% and the F-score improvement reaches 12%. The new algorithm outperforms conditional random fields, especially on boundary detection in terms of F-score. It also outperforms an n-gram language model on the task of pitch accent detection.

Using Prosody for Automatic Sentence Segmentation of Multi-party Meetings

Prosody-Based Automatic Segmentation of Speech into Sentences and Topics

Prosodic features improve sentence segmentation and parsing

Multi-microphone Automatic Speech Segmentation in Meetings Based on Circular Harmonics Features

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition

Self-Supervised Learning-Based Source Separation for Meeting Data

INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS

Automatic Meeting Participant Role Detection by Dialogue Patterns

Exploiting Contextual Information for Prosodic Event Detection Using Auto-Context

Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction with Multi-Modal Embeddings

Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data

Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization

Automatic Prosody Annotation with Pre-Trained Text-Speech Model

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings.

Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications

Simultaneous Speech Extraction for Multiple Target Speakers under the Meeting Scenarios

Simultaneous Diarization and Separation of Meetings through the Integration of Statistical Mixture Models

Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings