PhonHuBERT: A Phoneme Transcription Tool for Song Datasets

Amaury Prat,Runxuan Yang,Xiaolin Hu
DOI: https://doi.org/10.1007/978-981-97-4399-5_12
2024-01-01
Abstract:In recent years, deep learning has gradually replaced traditional mathematical inference-based architectures, such as Hidden Markov Chains in Singing Voice Synthesis (SVS) systems, which led to an increase in the demand for accurately labeled data. In response to this need, this work introduces an Aligned Phoneme Sequence Transcription (APST) model for automatic song datasets annotation, called PhonHuBERT. This model uses HuBERT - a pre-trained self-supervised model for voice features classification - as an encoder, combined with Bidirectional Long-Short Term Memory (BLSTM) networks as a decoder.
What problem does this paper attempt to address?