Dr.VOT : Measuring Positive and Negative Voice Onset Time in the Wild

Yosi Shrem,Matthew Goldrick,Joseph Keshet
DOI: https://doi.org/10.48550/arXiv.1910.13255
2019-10-27
Abstract:Voice Onset Time (VOT), a key measurement of speech for basic research and applied medical studies, is the time between the onset of a stop burst and the onset of voicing. When the voicing onset precedes burst onset the VOT is negative; if voicing onset follows the burst, it is positive. In this work, we present a deep-learning model for accurate and reliable measurement of VOT in naturalistic speech. The proposed system addresses two critical issues: it can measure positive and negative VOT equally well, and it is trained to be robust to variation across annotations. Our approach is based on the structured prediction framework, where the feature functions are defined to be RNNs. These learn to capture segmental variation in the signal. Results suggest that our method substantially improves over the current state-of-the-art. In contrast to previous work, our Deep and Robust VOT annotator, <a class="link-external link-http" href="http://Dr.VOT" rel="external noopener nofollow">this http URL</a>, can successfully estimate negative VOTs while maintaining state-of-the-art performance on positive VOTs. This high level of performance generalizes to new corpora without further retraining. Index Terms: structured prediction, multi-task learning, adversarial training, recurrent neural networks, sequence segmentation.
Audio and Speech Processing,Machine Learning,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: accurately measure the Voice Onset Time (VOT) in the natural speech environment, especially accurately measure both positive VOT and negative VOT simultaneously, and improve the generalization ability of the model for different datasets. ### Problem Background VOT refers to the time difference between the start of the stop - consonant burst and the start of voicing. When voicing occurs earlier than the burst, it is called prevoiced, and at this time, VOT is negative; when voicing occurs later than the burst, VOT is positive. VOT is a key feature for distinguishing voiced and voiceless consonants and is of great significance in linguistics, clinical research, and Automatic Speech Recognition (ASR) tasks. ### Main Challenges 1. **Difficulty in Measuring Negative VOT**: Existing methods are difficult to accurately measure negative VOT, especially in the natural context, where the magnitude of negative VOT is small and varies greatly. 2. **Insufficient Generalization Ability across Datasets**: Differences in speaker characteristics (such as age, gender, etc.) and recording environments among different datasets lead to poor performance of existing models on new datasets. ### Solutions The paper proposes a deep - learning model named Dr.VOT, aiming to solve the above problems: - **Structured Prediction Framework**: Use a Bidirectional Recurrent Neural Network (BiRNN) as a feature function to capture the dynamic changes in the speech signal. - **Multi - Task Learning (MTL)**: Use VOT classification (positive/negative) as an auxiliary task to improve the performance of the main task (VOT measurement). - **Adversarial Training**: By introducing an adversarial branch, make the model robust in different datasets and avoid overfitting the characteristics of specific datasets. ### Experimental Results Experiments show that Dr.VOT performs well on both known and unknown datasets, especially achieving significant improvement in negative VOT measurement. Specifically: - On the unknown dataset, the accuracy of Dr.VOT in negative VOT measurement within a 2 - millisecond tolerance reaches 32.4%, which is approximately 9.2% higher than the best existing method. - For all VOT types, Dr.VOT improves the accuracy by 2% within a 10 - millisecond tolerance compared to existing methods. In conclusion, this paper successfully solves the problem of VOT measurement in the natural context by combining multi - task learning and adversarial training, especially making significant progress in negative VOT measurement and cross - dataset generalization ability.