A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons

Tzu-Yun Hung,Jui-Te Wu,Yu-Chia Kuo,Yo-Wei Hsiao,Ting-Wei Lin,Li Su
2024-06-26
Abstract:Expressive music synthesis (EMS) for violin performance is a challenging task due to the disagreement among music performers in the interpretation of expressive musical terms (EMTs), scarcity of labeled recordings, and limited generalization ability of the synthesis model. These challenges create trade-offs between model effectiveness, diversity of generated results, and controllability of the synthesis system, making it essential to conduct a comparative study on EMS model design. This paper explores two violin EMS approaches. The end-to-end approach is a modification of a state-of-the-art text-to-speech generator. The parameter-controlled approach is based on a simple parameter sampling process that can render note lengths and other parameters compatible with MIDI-DDSP. We study these two approaches (in total, three model variants) through objective and subjective experiments and discuss several key issues of EMS based on the results.
Sound,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to address the challenges of Expressive Music Synthesis (EMS) in violin performance. Specifically, the EMS task faces the following main problems: 1. **Subjectivity and diversity in the interpretation of Expressive Music Terms (EMTs)**: - There are differences between music performers and listeners in interpreting EMTs, making it difficult for the model to unify the standards. - The definition and application of EMTs are highly dependent on context and personal understanding. 2. **Scarcity of labeled data**: - There are very few audio recordings with EMT labels, which limits the amount and diversity of data for model training. - Even the largest existing dataset (such as SCREAM - MAC - EMT) contains only 10 short tracks and lacks sufficient diversity. 3. **Generalization ability and controllability of the model**: - The EMS model needs to balance the effectiveness, diversity of the generated results, and the controllability of the system. - The model should support functions such as user - editing of the output, adding grace notes to specific notes, and be able to handle different input formats (such as MIDI and musicXML). 4. **The possibility of achieving EMT - conditional generation**: - Research on how to generate music performances with corresponding emotional expressions according to the specified EMT, ensuring that the generated results are similar to or better than human performances. ### Solutions To address the above challenges, the paper explores two main EMS methods: 1. **End - to - End Model**: - Modify the existing state - of - the - art text - to - speech generators (such as StyleSpeech) to directly generate the final audio from MIDI and EMT inputs. - This model does not rely on low - level features as controllable inputs, but controls the music style and expression through high - level embedding layers. 2. **Parameter - Controlled Model**: - Based on a simple parameter sampling process, adjust note lengths and other parameters to fit the MIDI - DDSP framework. - It includes two variants: one uses MIDI input, and the other uses musicXML input (which supports adding grace note symbols). Through objective experiments and subjective evaluations, the paper compares the performance of these two methods and their variants and discusses the key issues in the EMS task. The research results show that different models perform differently on different EMTs, emphasizing the complexity and multi - dimensional challenges of the EMS task.