An Investigation of Fundamental Frequency Pattern Prediction for Japanese Electrolaryngeal Speech Enhancement Based on Frame-Wise Phoneme Representations
Mohammad Eshghi,Tomoki Toda
DOI: https://doi.org/10.1109/access.2024.3384973
IF: 3.9
2024-04-12
IEEE Access
Abstract:Total laryngectomy (TL) is as a well-established treatment for advanced laryngeal malignancies, entailing the complete removal of the larynx. Speech rehabilitation following TL is crucial for improving the quality of life and facilitating social reintegration. Electrolaryngeal (EL) speech, a widely used voice restoration technique utilizing external excitation signals, often produces artificial and monotonous sound quality. Efforts to enhance EL speech include the application of statistical voice conversion and neural approaches to speech enhancement. These approaches typically aim to map spectral features into acoustic characteristics, including the fundamental frequency ( ). However, challenges arise owing to substantial discrepancies and pattern differences between extracted features for EL and normal speech, compounded by limited clinical training data. To address this issue, we explored pattern prediction based on frame-wise phoneme information using bidirectional long short-term memory recurrent neural networks. Beyond direct predictions based on phoneme labels, we expanded our analysis to include real-valued phoneme embeddings and conducted predictions for clustered embeddings representing low-dimensional input representations. Our findings demonstrate that both regression and classification predictive modeling can map frame-wise phoneme information into natural patterns. Additionally, phoneme labels can be considered as shared features between EL and normal speech, allowing for improved prediction accuracies by incorporating phoneme information from normal speech into the training sets for EL speech. Furthermore, by learning phoneme embeddings and creating input features for prediction based on the clustering of these embeddings, accurate patterns can be predicted, and the challenge of finding a strategy to reduce the dimensionality of the input features can be effectively alleviated.
computer science, information systems,telecommunications,engineering, electrical & electronic