Temporal feature extraction based on CNN-BLSTM and temporal pooling for language identification

Xiuyan Liu,Chen Chen,Yongjun He
DOI: https://doi.org/10.1016/j.apacoust.2022.108854
IF: 3.614
2022-01-01
Applied Acoustics
Abstract:In this paper, a temporal feature extraction method based on convolutional neural network-bidirectional long-short term memory (CNN-BLSTM) and temporal pooling (TMPOOL) is proposed for language identification. First, the CNN-BLSTM model is employed as a front-end local feature extractor which learns temporal representation from acoustic features in both forward and backward direction. Then the temporal pooling unit, which is a non-linear support vector regression (SVR) machine, can efficiently learn the ordering relationship between the hidden states of BLSTM and its time indexes. At last, this ordering relationship is utilized as an utterance-level representation. Furthermore, we conducted the experiments on three tasks of the oriental language recognition (OLR-2019) challenge. Compared with other CNN (BLSTM) methods, the proposed method achieves comparable error reductions. (C) 2022 Elsevier Ltd. All rights reserved.
What problem does this paper attempt to address?