Disfluency annotated corpora for Indian English in technical domains

Vandan Mujadia,Pruthwik Mishra,Dipti Misra Sharma
DOI: https://doi.org/10.1007/s10579-024-09781-5
2024-10-27
Language Resources and Evaluation
Abstract:Disfluencies are common in spontaneous speech and can significantly affect the accuracy of automated systems that process spoken input. In this work, we tackled this issue for Indian English by developing a human-annotated disfluency corpus ( DASIE (H) ) comprising over 240K words for the technical lecture domain. To have a larger disfluency dataset, we introduced a method to generate synthetic disfluency, employing contextual embeddings and shallow linguistic features such as part-of-speech patterns. This algorithm allowed us to generate a synthetic disfluency corpus ( DASIE (S) ) that exceeds 15.4 million words. We evaluate the efficacy of our disfluency-annotated corpora by developing models for disfluency identification. Our efforts result in achieving the highest F1 score of 0.93 on the Switchboard test set and 0.80 on the DASIE (H) test set with the coarser disfluency identifier. The resulting corpora and model can be utilized to effectively detect and process disfluencies in various speech-interfacing applications.
computer science, interdisciplinary applications
What problem does this paper attempt to address?