Abstract:Following the success of the transformer architecture in the natural language domain, transformer-like architectures have been widely applied to the domain of symbolic music recently. Symbolic music and text, however, are two different modalities. Symbolic music contains multiple attributes, both absolute attributes (e.g., pitch) and relative attributes (e.g., pitch interval). These relative attributes shape human perception of musical motifs. These important relative attributes, however, are mostly ignored in existing symbolic music modelling methods with the main reason being the lack of a musically-meaningful embedding space where both the absolute and relative embeddings of the symbolic music tokens can be efficiently represented. In this paper, we propose the Fundamental Music Embedding (FME) for symbolic music based on a bias-adjusted sinusoidal encoding within which both the absolute and the relative attributes can be embedded and the fundamental musical properties (e.g., translational invariance) are explicitly preserved. Taking advantage of the proposed FME, we further propose a novel attention mechanism based on the relative index, pitch and onset embeddings (RIPO attention) such that the musical domain knowledge can be fully utilized for symbolic music modelling. Experiment results show that our proposed model: RIPO transformer which utilizes FME and RIPO attention outperforms the state-of-the-art transformers (i.e., music transformer, linear transformer) in a melody completion task. Moreover, using the RIPO transformer in a downstream music generation task, we notice that the notorious degeneration phenomenon no longer exists and the music generated by the RIPO transformer outperforms the music generated by state-of-the-art transformer models in both subjective and objective evaluations. The code of the proposed method is available online: github.com/guozixunnicolas/FundamentalMusicEmbedding.

The Power of Fragmentation: A Hierarchical Transformer Model for Structural Segmentation in Symbolic Music Generation

Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation

Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Whole-Song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models

Small Tunes Transformer: Exploring Macro & Micro-Level Hierarchies for Skeleton-Conditioned Melody Generation

A Multi-Scale Attentive Transformer for Multi-Instrument Symbolic Music Generation

Hyperbolic Music Transformer for Structured Music Generation

PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation

Structure-informed Positional Encoding for Music Generation

A Domain-Knowledge-Inspired Music Embedding Space and a Novel Attention Mechanism for Symbolic Music Modeling

Coordinate Embedding Transformer Model for Optical Music Recognition on Monophonic Scores

Do we need more complex representations for structure? A comparison of note duration representation for Music Transformers

N-Gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding

Continuous Melody Generation via Disentangled Short-Term Representations and Structural Conditions

Hierarchical Symbolic Pop Music Generation with Graph Neural Networks

MIDI-Sandwich: Multi-model Multi-task Hierarchical Conditional VAE-GAN networks for Symbolic Single-track Music Generation

MuPT: A Generative Symbolic Music Pretrained Transformer

Symbolic Music Generation with Transformer-GANs

MeloTrans: A Text to Symbolic Music Generation Model Following Human Composition Habit

Symphony Generation with Permutation Invariant Language Model

MELONS: generating melody with long-term structure using transformers and structure graph