Optimizing medical personnel speech recognition models using speech synthesis and reinforcement learning

Andrzej Czyzewski
DOI: https://doi.org/10.1121/10.0023271
2023-10-01
The Journal of the Acoustical Society of America
Abstract:Text-to-Speech synthesis (TTS) can be used to generate training data for building Automatic Speech Recognition models (ASR). Access to medical speech data is because it is sensitive data that is difficult to obtain for privacy reasons. Speech can be synthesized by mimicking different accents, dialects, and speaking styles in a medical language. Reinforcement Learning (RL), in the context of ASR, can be used to optimize a model. A model can be trained to minimize errors in speech-to-text transcription, especially for technical medical terminology. In this case, the “reward” to the RL model can be negatively proportional to the number of transcription errors. The paper presents a method and experimental study from which it is concluded that the combination of TTS and RL can enable the creation of a speech recognition model suited to the specific needs of medical personnel, helping to expand the training data and optimize the model to minimize transcription errors. The learning process used reward functions based on Mean Opinion Score (MOS), a subjective metric for assessing speech quality, and Word Error Rate (WER), which evaluates the quality of speech-to-text transcription. [The Polish National Center for Research and Development (NCBR) supported the project: “ADMEDVOICE- Adaptive intelligent speech processing system of medical personnel with the structuring of test results and support of therapeutic process,” no. INFOSTRATEG4/0003/2022.]
acoustics,audiology & speech-language pathology
What problem does this paper attempt to address?