Abstract:In recent years, artificial neural networks (ANNs) have become a universal tool for tackling real-world problems. ANNs have also shown great success in music-related tasks including music summarization and classification, similarity estimation, computer-aided or autonomous composition, and automatic music analysis. As structure is a fundamental characteristic of Western music, it plays a role in all these tasks. Some structural aspects are particularly challenging to learn with current ANN architectures. This is especially true for mid- and high-level self-similarity, tonal and rhythmic relationships. In this thesis, I explore the application of ANNs to different aspects of musical structure modeling, identify some challenges involved and propose strategies to address them. First, using probability estimations of a Restricted Boltzmann Machine (RBM), a probabilistic bottom-up approach to melody segmentation is studied. Then, a top-down method for imposing a high-level structural template in music generation is presented, which combines Gibbs sampling using a convolutional RBM with gradient-descent optimization on the intermediate solutions. Furthermore, I motivate the relevance of musical transformations in structure modeling and show how a connectionist model, the Gated Autoencoder (GAE), can be employed to learn transformations between musical fragments. For learning transformations in sequences, I propose a special predictive training of the GAE, which yields a representation of polyphonic music as a sequence of intervals. Furthermore, the applicability of these interval representations to a top-down discovery of repeated musical sections is shown. Finally, a recurrent variant of the GAE is proposed, and its efficacy in music prediction and modeling of low-level repetition structure is demonstrated.

Integrating Text-to-Music Models with Language Models: Composing Long Structured Music Pieces

Large Language Models: From Notes to Musical Form

Musical Form Generation

Long-form music generation with latent diffusion

Multi-Genre Music Transformer -- Composing Full Length Musical Piece

Continuous Melody Generation via Disentangled Short-Term Representations and Structural Conditions

Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

MusicLM: Generating Music From Text

Language Models are Drummers: Drum Composition with Natural Language Pre-Training

Structured Music Transformer: Structured Conditional Music Generation Based on Stylistic Clustering Using Transformer

Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion

Modeling Musical Structure with Artificial Neural Networks

MeloTrans: A Text to Symbolic Music Generation Model Following Human Composition Habit

ByteComposer: a Human-like Melody Composition Method based on Language Model Agent

The Power of Fragmentation: A Hierarchical Transformer Model for Structural Segmentation in Symbolic Music Generation

Conditioning Deep Generative Raw Audio Models for Structured Automatic Music

Bar Transformer: a Hierarchical Model for Learning Long-Term Structure and Generating Impressive Pop Music

2019 Formatting Instructions for Authors Using LaTeX

Do we need more complex representations for structure? A comparison of note duration representation for Music Transformers

Video-driven musical composition using large language model with memory-augmented state space

SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation