Abstract:Note alignment refers to the task of matching individual notes of two versions of the same symbolically encoded piece. Methods addressing this task commonly rely on sequence alignment algorithms such as Hidden Markov Models or Dynamic Time Warping (DTW) applied directly to note or onset sequences. While successful in many cases, such methods struggle with large mismatches between the versions. In this work, we learn note-wise representations from data augmented with various complex mismatch cases, e.g. repeats, skips, block insertions, and long trills. At the heart of our approach lies a transformer encoder network - TheGlueNote - which predicts pairwise note similarities for two 512 note subsequences. We postprocess the predicted similarities using flavors of weightedDTW and pitch-separated onsetDTW to retrieve note matches for two sequences of arbitrary length. Our approach performs on par with the state of the art in terms of note alignment accuracy, is considerably more robust to version mismatches, and works directly on any pair of MIDI files.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of **note alignment**, that is, to match the individually symbol - encoded notes in two versions of the same musical piece. Specifically: 1. **Limitations of existing methods**: - Existing note - alignment methods usually rely on sequence - alignment algorithms, such as Hidden Markov Model (HMM) or Dynamic Time Warping (DTW), which are directly applied to notes or onset sequences. - These methods perform poorly when dealing with large mismatches between versions, such as complex situations like repetitions, skips, block insertions and long trills. 2. **Research objectives**: - Propose a new method to improve the robustness and flexibility of note alignment by learning note representations. - The core of the method is a Transformer encoder network named **TheGlueNote**, which predicts the pairwise note similarity between two 512 - note subsequences. - Use weighted DTW and onset - DTW with separated pitch to post - process the predicted similarity in order to retrieve note matches from two sequences of arbitrary length. 3. **Innovations**: - **Utilization of non - local information**: Through the Transformer encoder, the entire note sequence affects the representation of each note, thus better handling complex mismatch situations. - **Data augmentation**: Use data containing various complex mismatch situations for training, enabling the model to more robustly predict note similarity. - **No need for additional annotations**: Unlike traditional methods, this model directly processes MIDI files and does not require quantized music, score annotations or other attributes. 4. **Performance improvement**: - In terms of note - alignment accuracy, this method is comparable to the existing state - of - the - art methods, but is more robust when dealing with complex mismatch situations. - It can be directly applied to any pair of MIDI files, with higher flexibility and applicability. In summary, by introducing learned note representations and data - augmentation techniques, this paper aims to solve the limitations of existing note - alignment methods in dealing with complex mismatch situations, thus providing a more robust and flexible note - alignment solution.

TheGlueNote: Learned Representations for Robust and Flexible Note Alignment

Online Symbolic Music Alignment with Offline Reinforcement Learning

Just Label the Repeats for In-The-Wild Audio-to-Score Alignment

MIDI-Sheet Music Alignment Using Bootleg Score Synthesis

Aligned Music Notation and Lyrics Transcription

Unaligned Supervision For Automatic Music Transcription in The Wild

Generative Adversarial Network for Musical Notation Recognition during Music Teaching

Coordinate Embedding Transformer Model for Optical Music Recognition on Monophonic Scores

A Study of Annotation and Alignment Accuracy for Performance Comparison in Complex Orchestral Music

Learning Audio - Sheet Music Correspondences for Score Identification and Offline Alignment

Engraving Oriented Joint Estimation of Pitch Spelling and Local and Global Keys

One TTS Alignment To Rule Them All

Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages

ChordSync: Conformer-Based Alignment of Chord Annotations to Music Audio

Do we need more complex representations for structure? A comparison of note duration representation for Music Transformers

Unsupervised Generative Adversarial Alignment Representation for Sheet music, Audio and Lyrics

Soft Dynamic Time Warping for Multi-Pitch Estimation and Beyond

Improving Lyrics Alignment Through Joint Pitch Detection

Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment

Audio to Score Alignment Based on Chroma Features and Dynamic Time Warping Algorithm

End-to-end Piano Performance-MIDI to Score Conversion with Transformers