YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

Sungkyun Chang,Emmanouil Benetos,Holger Kirchhoff,Simon Dixon
2024-08-01
Abstract:Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument. This task is challenging for modeling as it requires simultaneously identifying multiple instruments and transcribing their pitch and precise timing, and the lack of fully annotated data adds to the training difficulties. This paper introduces YourMT3+, a suite of models for enhanced multi-instrument music transcription based on the recent language token decoding approach of MT3. We enhance its encoder by adopting a hierarchical attention transformer in the time-frequency domain and integrating a mixture of experts. To address data limitations, we introduce a new multi-channel decoding method for training with incomplete annotations and propose intra- and cross-stem augmentation for dataset mixing. Our experiments demonstrate direct vocal transcription capabilities, eliminating the need for voice separation pre-processors. Benchmarks across ten public datasets show our models' competitiveness with, or superiority to, existing transcription models. Further testing on pop music recordings highlights the limitations of current models. Fully reproducible code and datasets are available with demos at \url{<a class="link-external link-https" href="https://github.com/mimbres/YourMT3" rel="external noopener nofollow">this https URL</a>}.
Audio and Speech Processing,Machine Learning,Sound
What problem does this paper attempt to address?
This paper attempts to solve several key problems in multi - instrument music transcription: 1. **Simultaneously identifying and transcribing multiple instruments**: The goal of multi - instrument music transcription is to convert polyphonic music recordings into musical scores corresponding to each instrument, which requires simultaneously identifying multiple instruments and transcribing their pitch and precise time information. This is a very challenging task because the model must be able to handle complex audio signals and accurately separate the sounds of individual instruments. 2. **The problem of insufficient data annotation**: Fully annotated datasets are scarce, which increases the difficulty of training. Many existing datasets are only partially annotated, or do not annotate the information of certain instruments at all. Therefore, how to effectively use these incompletely annotated data has become an urgent problem to be solved. 3. **Directly transcribing human voices without pre - processing**: In multi - instrument music, the human voice is a particularly important component. However, traditional methods usually require human voice separation first and then transcription. This paper proposes a new method that can directly transcribe audio from mixed audio without the need for an additional human voice separation step. 4. **Improving the performance and generalization ability of existing models**: Although some progress has been made, existing multi - instrument music transcription models still have limitations when dealing with actual music (such as pop music). For example, they perform poorly when dealing with commercial pop music recordings, especially on non - mainstream instruments. Therefore, how to improve the performance of the model on such complex audio is also an important issue. ### Specific solutions To address the above challenges, the author proposes YourMT3 +, an improved model based on the Transformer architecture, which mainly includes the following innovations: - **Enhanced encoder**: A hierarchical attention transformer and a mixture of experts (MoE) are adopted to better capture features in the time - frequency domain. - **Multi - channel decoder**: A multi - channel decoder is introduced, enabling the model to process partially annotated data and be trained through task - query methods, improving the model's flexibility and robustness. - **Data augmentation techniques**: Online data augmentation strategies such as cross - dataset stem augmentation and pitch - shifting are proposed to increase the diversity of training samples, thereby enhancing the model's generalization ability. Through these improvements, YourMT3 + outperforms existing transcription models on multiple public datasets and demonstrates the ability to directly transcribe audio without prior human voice separation processing. In addition, the experimental results also show that the model performs well on synthetic datasets, but still has certain limitations on commercial pop music recordings, suggesting that future research directions may include more diverse training data and more refined pitch processing methods.