A multi-purpose audio-visual corpus for multi-modal persian speech recognition: The Arman-AV dataset
Javad Peymanfard,Samin Heydarian,Ali Lashini,Hossein Zeinali,Mohammad Reza Mohammadi,Nasser Mozayani
DOI: https://doi.org/10.1016/j.eswa.2023.121648
IF: 8.5
2023-09-01
Expert Systems with Applications
Abstract:Automatic lip reading has advanced significantly in recent years. However, these methods need large-scale datasets that are scarce for many low-resource languages. In this paper, we introduce a new multipurpose audio-visual dataset for Persian. The dataset contains approximately 220 h of videos from 1760 speakers. The dataset can be used for multiple tasks, such as lip reading, automatic speech recognition, audio-visual speech recognition, and speaker recognition. It is also the first large-scale lip reading dataset in this language. We provide a baseline method for each task and propose a technique to identify visemes (visual units of speech) in Persian. The visemes obtained by this technique improve the accuracy of the lip reading task by 7% relatively compared to the previously proposed visemes, which can be generalized to other languages as well.
computer science, artificial intelligence,engineering, electrical & electronic,operations research & management science