Abstract:This thesis develops a Transformer model based on Whisper, which extracts melodies and chords from music audio and records them into ABC notation. A comprehensive data processing workflow is customized for ABC notation, including data cleansing, formatting, and conversion, and a mutation mechanism is implemented to increase the diversity and quality of training data. This thesis innovatively introduces the "Orpheus' Score", a custom notation system that converts music information into tokens, designs a custom vocabulary library, and trains a corresponding custom tokenizer. Experiments show that compared to traditional algorithms, the model has significantly improved accuracy and performance. While providing a convenient audio-to-score tool for music enthusiasts, this work also provides new ideas and tools for research in music information processing.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the automatic conversion from music audio to musical scores, specifically, converting music audio files into musical scores represented by ABC notation. The main existing challenges include: 1. **Difficulty in data acquisition**: The cost of annotating music data is high and the process is complex. 2. **Poor generalization ability of models**: Existing models have difficulty adapting to different styles of music. 3. **Incomplete information extraction**: Most algorithms can only extract one part of rhythm, main melody or chords, and it is difficult to extract melody and chords simultaneously. To solve these problems, this paper proposes a model based on the Transformer architecture, combines the advantages of the Whisper pre - training model, and introduces a custom "Orpheus’ Score" notation method, aiming to improve the accuracy and performance of music audio - to - score conversion. ### Specific problem description - **Deficiencies in music transcription technology**: Although professionals can easily identify melodies and chords through practice and experience, for music lovers and beginners, manual music transcription is both expensive and inefficient. Therefore, developing an automated audio - to - score tool is of great significance for music lovers and the field of music education. - **Limitations of existing methods**: - Difficulty in data acquisition and high annotation cost. - Poor generalization ability of models and difficulty in adapting to different styles of music. - Existing algorithms can usually only extract one part of rhythm, main melody or chords, and it is difficult to process melody and chords simultaneously. ### Solutions The methods proposed in the paper mainly include the following aspects: 1. **Data processing workflow**: A comprehensive set of data processing procedures has been customized, including data cleaning, formatting and conversion, to ensure high - quality input data. 2. **Mutation mechanism**: By introducing a mutation mechanism to increase the diversity and quality of training data. 3. **Custom notation "Orpheus’ Score"**: A new notation method has been designed that can contain both melody and chord information simultaneously, and a custom tokenizer has been trained. 4. **Model architecture**: Based on the Transformer architecture, the Whisper pre - training model is used for improvement to better adapt to the task of music audio - to - score conversion. Through these innovations, the paper shows a significant improvement in accuracy and performance of this model compared with traditional algorithms, providing new ideas and tools for music information processing research.

Audio-to-Score Conversion Model Based on Whisper methodology

Music Waveform Analysis Based on SOM Neural Network and Big Data

MelodyT5: A Unified Score-to-Score Transformer for Symbolic Music Processing

End-to-end Piano Performance-MIDI to Score Conversion with Transformers

Coordinate Embedding Transformer Model for Optical Music Recognition on Monophonic Scores

Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription

End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding

YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

Score Transformer: Generating Musical Score from Note-level Representation

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

Choir Transformer: Generating Polyphonic Music with Relative Attention on Transformer

Piano automatic transcription based on transformer

VITS-Based Singing Voice Conversion Leveraging Whisper and multi-scale F0 Modeling

End-to-End Whisper to Natural Speech Conversion using Modified Transformer Network

Harmonic Frequency-Separable Transformer for Instrument-Agnostic Music Transcription

PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

A Scalable Sparse Transformer Model for Singing Melody Extraction.

Transformer-based Model for ASR N-Best Rescoring and Rewriting

MuPT: A Generative Symbolic Music Pretrained Transformer