Abstract:Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument. This task is challenging for modeling as it requires simultaneously identifying multiple instruments and transcribing their pitch and precise timing, and the lack of fully annotated data adds to the training difficulties. This paper introduces YourMT3+, a suite of models for enhanced multi-instrument music transcription based on the recent language token decoding approach of MT3. We enhance its encoder by adopting a hierarchical attention transformer in the time-frequency domain and integrating a mixture of experts. To address data limitations, we introduce a new multi-channel decoding method for training with incomplete annotations and propose intra- and cross-stem augmentation for dataset mixing. Our experiments demonstrate direct vocal transcription capabilities, eliminating the need for voice separation pre-processors. Benchmarks across ten public datasets show our models' competitiveness with, or superiority to, existing transcription models. Further testing on pop music recordings highlights the limitations of current models. Fully reproducible code and datasets are available with demos at \url{<a class="link-external link-https" href="https://github.com/mimbres/YourMT3" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in multi - instrument music transcription: 1. **Simultaneously identifying and transcribing multiple instruments**: The goal of multi - instrument music transcription is to convert polyphonic music recordings into musical scores corresponding to each instrument, which requires simultaneously identifying multiple instruments and transcribing their pitch and precise time information. This is a very challenging task because the model must be able to handle complex audio signals and accurately separate the sounds of individual instruments. 2. **The problem of insufficient data annotation**: Fully annotated datasets are scarce, which increases the difficulty of training. Many existing datasets are only partially annotated, or do not annotate the information of certain instruments at all. Therefore, how to effectively use these incompletely annotated data has become an urgent problem to be solved. 3. **Directly transcribing human voices without pre - processing**: In multi - instrument music, the human voice is a particularly important component. However, traditional methods usually require human voice separation first and then transcription. This paper proposes a new method that can directly transcribe audio from mixed audio without the need for an additional human voice separation step. 4. **Improving the performance and generalization ability of existing models**: Although some progress has been made, existing multi - instrument music transcription models still have limitations when dealing with actual music (such as pop music). For example, they perform poorly when dealing with commercial pop music recordings, especially on non - mainstream instruments. Therefore, how to improve the performance of the model on such complex audio is also an important issue. ### Specific solutions To address the above challenges, the author proposes YourMT3 +, an improved model based on the Transformer architecture, which mainly includes the following innovations: - **Enhanced encoder**: A hierarchical attention transformer and a mixture of experts (MoE) are adopted to better capture features in the time - frequency domain. - **Multi - channel decoder**: A multi - channel decoder is introduced, enabling the model to process partially annotated data and be trained through task - query methods, improving the model's flexibility and robustness. - **Data augmentation techniques**: Online data augmentation strategies such as cross - dataset stem augmentation and pitch - shifting are proposed to increase the diversity of training samples, thereby enhancing the model's generalization ability. Through these improvements, YourMT3 + outperforms existing transcription models on multiple public datasets and demonstrates the ability to directly transcribe audio without prior human voice separation processing. In addition, the experimental results also show that the model performs well on synthetic datasets, but still has certain limitations on commercial pop music recordings, suggesting that future research directions may include more diverse training data and more refined pitch processing methods.

YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage

Multi-Instrument Polyphonic Melody Transcription Based on Deep Learning

Unaligned Supervision For Automatic Music Transcription in The Wild

End-to-end Piano Performance-MIDI to Score Conversion with Transformers

Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing

Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training

Improved Architecture for High-resolution Piano Transcription to Efficiently Capture Acoustic Characteristics of Music Signals

High Resolution Guitar Transcription via Domain Adaptation

Annotation-free Automatic Music Transcription with Scalable Synthetic Data and Adversarial Domain Confusion

Mel-RoFormer for Vocal Separation and Vocal Melody Transcription

Jointist: Joint Learning for Multi-instrument Transcription and Its Applications

LakhNES: Improving multi-instrumental music generation with cross-domain pre-training

Multitrack Music Transcription with a Time-Frequency Perceiver

MFAE: Masked frame-level autoencoder with hybrid-supervision for low-resource music transcription

A Multi-Scale Attentive Transformer for Multi-Instrument Symbolic Music Generation

Towards Musically Informed Evaluation of Piano Transcription Models

Transfer of knowledge among instruments in automatic music transcription

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

Machine Learning Techniques in Automatic Music Transcription: A Systematic Survey

Invariances and Data Augmentation for Supervised Music Transcription