Investigation of Various Hybrid Acoustic Modeling Units Via a Multitask Learning and Deep Neural Network Technique for LVCSR of the Low-Resource Language, Amharic

Tessfu Geteye Fantaye,Junqing Yu,Tulu Tilahun Hailu
DOI: https://doi.org/10.1109/access.2019.2931391
IF: 3.9
2019-01-01
IEEE Access
Abstract:Multitask learning (MTL) is helpful for improving the performance of related tasks when the training dataset is limited and sparse, especially for low-resource languages. Amharic is a low-resource language and suffers from the problems of training data scarcity, sparsity, and unevenness. Consequently, fundamental acoustic units-based speech recognizers perform worse compared with the speech recognizers of technologically favored languages. This paper presents the results of our contributions to the use of various hybrid acoustic modeling units for the Amharic language. The fundamental acoustic units, namely, syllable, phone, and rounded phone units-based deep neural network (DNN) models have been developed. Various hybrid acoustic units have been investigated by jointly training the fundamental acoustic units via the MTL technique. Those hybrid units and the fundamental units are discussed and compared. The experimental results demonstrate that all the fundamental units-based DNN models outperform the Gaussian mixture models (GMM) with relative performance improvements of 14.14%-23.31%. All the hybrid units outperform the fundamental acoustic units with relative performance improvements of 1.33%-4.27%. The syllable and phone units exhibit higher performance under sufficient and limited training datasets, respectively. All the hybrid units are useful with both sufficient and limited training datasets and outperformed the fundamental units. Overall, our results show that DNN is an effective acoustic modeling technique for the Amharic language. The context-dependent (CD) syllable is the more suitable unit if a sufficient training corpus is available and the accuracy of the recognizer is prioritized. The CD phone is a superior unit if the available training dataset is limited and realizes the highest accuracy and fast recognition speed. The hybrid acoustic units perform the best under both sufficient and limited training datasets and achieve the highest accuracy.
What problem does this paper attempt to address?