Audio-to-Score Conversion Model Based on Whisper methodology

Hongyao Zhang,Bohang Sun
2024-10-23
Abstract:This thesis develops a Transformer model based on Whisper, which extracts melodies and chords from music audio and records them into ABC notation. A comprehensive data processing workflow is customized for ABC notation, including data cleansing, formatting, and conversion, and a mutation mechanism is implemented to increase the diversity and quality of training data. This thesis innovatively introduces the "Orpheus' Score", a custom notation system that converts music information into tokens, designs a custom vocabulary library, and trains a corresponding custom tokenizer. Experiments show that compared to traditional algorithms, the model has significantly improved accuracy and performance. While providing a convenient audio-to-score tool for music enthusiasts, this work also provides new ideas and tools for research in music information processing.
Sound,Computation and Language,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the automatic conversion from music audio to musical scores, specifically, converting music audio files into musical scores represented by ABC notation. The main existing challenges include: 1. **Difficulty in data acquisition**: The cost of annotating music data is high and the process is complex. 2. **Poor generalization ability of models**: Existing models have difficulty adapting to different styles of music. 3. **Incomplete information extraction**: Most algorithms can only extract one part of rhythm, main melody or chords, and it is difficult to extract melody and chords simultaneously. To solve these problems, this paper proposes a model based on the Transformer architecture, combines the advantages of the Whisper pre - training model, and introduces a custom "Orpheus’ Score" notation method, aiming to improve the accuracy and performance of music audio - to - score conversion. ### Specific problem description - **Deficiencies in music transcription technology**: Although professionals can easily identify melodies and chords through practice and experience, for music lovers and beginners, manual music transcription is both expensive and inefficient. Therefore, developing an automated audio - to - score tool is of great significance for music lovers and the field of music education. - **Limitations of existing methods**: - Difficulty in data acquisition and high annotation cost. - Poor generalization ability of models and difficulty in adapting to different styles of music. - Existing algorithms can usually only extract one part of rhythm, main melody or chords, and it is difficult to process melody and chords simultaneously. ### Solutions The methods proposed in the paper mainly include the following aspects: 1. **Data processing workflow**: A comprehensive set of data processing procedures has been customized, including data cleaning, formatting and conversion, to ensure high - quality input data. 2. **Mutation mechanism**: By introducing a mutation mechanism to increase the diversity and quality of training data. 3. **Custom notation "Orpheus’ Score"**: A new notation method has been designed that can contain both melody and chord information simultaneously, and a custom tokenizer has been trained. 4. **Model architecture**: Based on the Transformer architecture, the Whisper pre - training model is used for improvement to better adapt to the task of music audio - to - score conversion. Through these innovations, the paper shows a significant improvement in accuracy and performance of this model compared with traditional algorithms, providing new ideas and tools for music information processing research.