Abstract:Voice Onset Time (VOT), a key measurement of speech for basic research and applied medical studies, is the time between the onset of a stop burst and the onset of voicing. When the voicing onset precedes burst onset the VOT is negative; if voicing onset follows the burst, it is positive. In this work, we present a deep-learning model for accurate and reliable measurement of VOT in naturalistic speech. The proposed system addresses two critical issues: it can measure positive and negative VOT equally well, and it is trained to be robust to variation across annotations. Our approach is based on the structured prediction framework, where the feature functions are defined to be RNNs. These learn to capture segmental variation in the signal. Results suggest that our method substantially improves over the current state-of-the-art. In contrast to previous work, our Deep and Robust VOT annotator, <a class="link-external link-http" href="http://Dr.VOT" rel="external noopener nofollow">this http URL</a>, can successfully estimate negative VOTs while maintaining state-of-the-art performance on positive VOTs. This high level of performance generalizes to new corpora without further retraining. Index Terms: structured prediction, multi-task learning, adversarial training, recurrent neural networks, sequence segmentation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: accurately measure the Voice Onset Time (VOT) in the natural speech environment, especially accurately measure both positive VOT and negative VOT simultaneously, and improve the generalization ability of the model for different datasets. ### Problem Background VOT refers to the time difference between the start of the stop - consonant burst and the start of voicing. When voicing occurs earlier than the burst, it is called prevoiced, and at this time, VOT is negative; when voicing occurs later than the burst, VOT is positive. VOT is a key feature for distinguishing voiced and voiceless consonants and is of great significance in linguistics, clinical research, and Automatic Speech Recognition (ASR) tasks. ### Main Challenges 1. **Difficulty in Measuring Negative VOT**: Existing methods are difficult to accurately measure negative VOT, especially in the natural context, where the magnitude of negative VOT is small and varies greatly. 2. **Insufficient Generalization Ability across Datasets**: Differences in speaker characteristics (such as age, gender, etc.) and recording environments among different datasets lead to poor performance of existing models on new datasets. ### Solutions The paper proposes a deep - learning model named Dr.VOT, aiming to solve the above problems: - **Structured Prediction Framework**: Use a Bidirectional Recurrent Neural Network (BiRNN) as a feature function to capture the dynamic changes in the speech signal. - **Multi - Task Learning (MTL)**: Use VOT classification (positive/negative) as an auxiliary task to improve the performance of the main task (VOT measurement). - **Adversarial Training**: By introducing an adversarial branch, make the model robust in different datasets and avoid overfitting the characteristics of specific datasets. ### Experimental Results Experiments show that Dr.VOT performs well on both known and unknown datasets, especially achieving significant improvement in negative VOT measurement. Specifically: - On the unknown dataset, the accuracy of Dr.VOT in negative VOT measurement within a 2 - millisecond tolerance reaches 32.4%, which is approximately 9.2% higher than the best existing method. - For all VOT types, Dr.VOT improves the accuracy by 2% within a 10 - millisecond tolerance compared to existing methods. In conclusion, this paper successfully solves the problem of VOT measurement in the natural context by combining multi - task learning and adversarial training, especially making significant progress in negative VOT measurement and cross - dataset generalization ability.

Dr.VOT : Measuring Positive and Negative Voice Onset Time in the Wild

A Novel and Efficient Voice Activity Detector Using Shape Features of Speech Wave.

VOAT: Voice Onset Analysis Tool

Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction

Longitudinal Speech Biomarkers for Automated Alzheimer's Detection

DigiVoice: Voice Biomarker Featurization and Analysis Pipeline

Learnable Spectro-temporal Receptive Fields for Robust Voice Type Discrimination

Voice Activity Detection Based on Time-Delay Neural Networks

ALO-VC: Any-to-any Low-latency One-shot Voice Conversion

FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter

Voice activity detection in the wild: A data-driven approach using teacher-student training

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

Digital Voice-Based Biomarker for Monitoring Respiratory Quality of Life: Findings from the Colive Voice Study

Pre-Trained Foundation Model representations to uncover Breathing patterns in Speech

Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline

VOT: Revolutionizing Speaker Verification with Memory and Attention Mechanisms

Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions.

Voice Disorder Detection Using Long Short Term Memory (LSTM) Model

Adversarial Post-Processing of Voice Conversion Against Spoofing Detection

Personal VAD: Speaker-Conditioned Voice Activity Detection

Voice Disorder Analysis: a Transformer-based Approach