CSLNSpeech: solving the extended speech separation problem with the help of Chinese Sign Language
Jiasong Wu,Xuan Li,Taotao Li,Fanman Meng,Youyong Kong,Guanyu Yang,Lotfi Senhadji,Huazhong Shu
DOI: https://doi.org/10.1016/j.specom.2024.103131
IF: 2.723
2024-09-05
Speech Communication
Abstract:Previous audio-visual speech separation methods synchronize the speaker's facial movement and speech in the video to self-supervise the speech separation. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network to learn the combination of three modalities, audio, face, and sign language information, to solve the speech separation problem better. We introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset to train the model, in which three modalities coexist: audio, face, and sign language. Experimental results show that the proposed model performs better and is more robust than the usual audio-visual system. In addition, the sign language modality can also be used alone to supervise speech separation tasks, and introducing sign language helps hearing-impaired people learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/iveveive/SLNSpeech
computer science, interdisciplinary applications,acoustics