A CIF-Based Speech Segmentation Method for Streaming E2E ASR
Yuchun Shu,Haoneng Luo,Shiliang Zhang,Longbiao Wang,Jianwu Dang
DOI: https://doi.org/10.1109/lsp.2023.3261662
2023-04-12
IEEE Signal Processing Letters
Abstract:Long utterances segmentation is crucial in end-to-end (E2E) streaming automatic speech recognition (ASR). However, commonly used voice activity detection(VAD)-based and fixed-length segmentation methods may lead to long segments and semantic incompleteness, affecting the user experience and ASR performance. In this paper, we propose a speech segmentation method for streaming E2E ASR to solve the above issues. Both the decoder's dependence on acoustic information and the human average breath frequency are used for judging segment boundaries. Frame-level decoder's dependence information is provided by the Continuous Integrate-and-Fire (CIF) predictor, which optimizes jointly with ASR to guarantee a more suitable segmentation for ASR. Besides, the proposed method does not increase the model parameters and real-time factor (RTF). The experimental results show that our method can accurately detect the pauses in speech, and the segment usually contains relatively complete semantic information. Compared with VAD-based segmentation, 53.5% latency reduction and 3.7% CER reduction relatively are achieved.
engineering, electrical & electronic