Research on the Quantity Evaluation of Speech Datasets for Model Training

Sun Li,Feng Cao,Zishan Liu
DOI: https://doi.org/10.1109/icicas53977.2021.00050
2021-12-01
Abstract:With the maturity of intelligent speech technology and product application, the demand for high-quality speech datasets is increasing. There have been some researchers put effort on the quality evaluation of the structured data, but there are few standards appeared for the speech datasets. By analyzing the construction principle of speech algorithm model and analyzing the construction demand of speech datasets, a unified quality assessment framework for the speech datasets is presented. The framework proposes to evaluate the speech datasets in terms of four dimensions: breadth coverage, anthology distinction, professional field depth and data integrity. By proposing specific speech datasets quality evaluation metrics, calculation methods and evaluation steps, and analyzing the experimental examples and results of speech datasets quality evaluation in vehicle application field, this paper provides a reference basis for evaluating speech datasets quality and promoting datasets construction. Considering the diversified applicability, privacy issues, efficiency and automation requirements of speech datasets construction, some suggestions for the future development of high-quality speech datasets construction are put forward.
What problem does this paper attempt to address?