Multi-Modal Multi-Scale Speech Expression Evaluation In Computer-Assisted Language Learning

Jingbei Li,Zhiyong Wu,Runnan Li,Mingxing Xu,Kehua Lei,Lianhong Cai
DOI: https://doi.org/10.1007/978-3-319-94361-9_2
2018-01-01
Abstract:Computer assisted language learning (CALL) has attracted increasing interest in language teaching and learning. In the computer-supported learning environment, both pronunciation correction and expression modulation are certified to be essential for contemporary learners. However, while mispronunciation detection and diagnosis (MDD) technologies have achieved significant successes, speech expression evaluation is still relied on expensive and resources consuming manual assessment. In this paper, we proposed a novel multi-modal multi-scale neural network based approach for automatic speech expression evaluation in CALL. In particular, a multi-modal sparse auto encoder (MSAE) is firstly employed to make full use of both lexical and acoustic features, a recurrent auto encoder (RAE) is further employed to produce the features at different time scale and an attention-based multi-scale bidirectional long-short term memory (BLSTM) model is finally employed to score the speech expression. Experimental results using data collected from realistic airline broadcast evaluation demonstrate the effectiveness of the proposed approach, achieving a human-level predictive ability with acceptable rate 70.4%.
What problem does this paper attempt to address?