TheGlueNote: Learned Representations for Robust and Flexible Note Alignment

Silvan David Peter,Gerhard Widmer
2024-08-08
Abstract:Note alignment refers to the task of matching individual notes of two versions of the same symbolically encoded piece. Methods addressing this task commonly rely on sequence alignment algorithms such as Hidden Markov Models or Dynamic Time Warping (DTW) applied directly to note or onset sequences. While successful in many cases, such methods struggle with large mismatches between the versions. In this work, we learn note-wise representations from data augmented with various complex mismatch cases, e.g. repeats, skips, block insertions, and long trills. At the heart of our approach lies a transformer encoder network - TheGlueNote - which predicts pairwise note similarities for two 512 note subsequences. We postprocess the predicted similarities using flavors of weightedDTW and pitch-separated onsetDTW to retrieve note matches for two sequences of arbitrary length. Our approach performs on par with the state of the art in terms of note alignment accuracy, is considerably more robust to version mismatches, and works directly on any pair of MIDI files.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of **note alignment**, that is, to match the individually symbol - encoded notes in two versions of the same musical piece. Specifically: 1. **Limitations of existing methods**: - Existing note - alignment methods usually rely on sequence - alignment algorithms, such as Hidden Markov Model (HMM) or Dynamic Time Warping (DTW), which are directly applied to notes or onset sequences. - These methods perform poorly when dealing with large mismatches between versions, such as complex situations like repetitions, skips, block insertions and long trills. 2. **Research objectives**: - Propose a new method to improve the robustness and flexibility of note alignment by learning note representations. - The core of the method is a Transformer encoder network named **TheGlueNote**, which predicts the pairwise note similarity between two 512 - note subsequences. - Use weighted DTW and onset - DTW with separated pitch to post - process the predicted similarity in order to retrieve note matches from two sequences of arbitrary length. 3. **Innovations**: - **Utilization of non - local information**: Through the Transformer encoder, the entire note sequence affects the representation of each note, thus better handling complex mismatch situations. - **Data augmentation**: Use data containing various complex mismatch situations for training, enabling the model to more robustly predict note similarity. - **No need for additional annotations**: Unlike traditional methods, this model directly processes MIDI files and does not require quantized music, score annotations or other attributes. 4. **Performance improvement**: - In terms of note - alignment accuracy, this method is comparable to the existing state - of - the - art methods, but is more robust when dealing with complex mismatch situations. - It can be directly applied to any pair of MIDI files, with higher flexibility and applicability. In summary, by introducing learned note representations and data - augmentation techniques, this paper aims to solve the limitations of existing note - alignment methods in dealing with complex mismatch situations, thus providing a more robust and flexible note - alignment solution.