Abstract:In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilizes instrument information and transcription results. The joint training of the transcription and source separation modules serves to improve the performance of both tasks. The instrument module is optional and can be directly controlled by human users. This makes Jointist a flexible user-controllable framework. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. Its novelty, however, necessitates a new perspective on how to evaluate such a model. In our experiments, we assess the proposed model from various aspects, providing a new evaluation perspective for multi-instrument transcription. Our subjective listening study shows that Jointist achieves state-of-the-art performance on popular music, outperforming existing multi-instrument transcription models such as MT3. We conducted experiments on several downstream tasks and found that the proposed method improved transcription by more than 1 percentage points (ppt.), source separation by 5 SDR, downbeat detection by 1.8 ppt., chord recognition by 1.4 ppt., and key estimation by 1.4 ppt., when utilizing transcription results obtained from Jointist. Demo available at \url{https://jointist.github.io/Demo}.

Source Separation of Piano Concertos Using Hybrid LSTM-Transformer Model

Source Separation of Piano Concertos Using Musically Motivated Augmentation Techniques

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

Source Separation & Automatic Transcription for Music

Piano automatic transcription based on transformer

End-to-end Piano Performance-MIDI to Score Conversion with Transformers

A study of audio mixing methods for piano transcription in violin-piano ensembles

Deep Learning Based Source Separation Applied To Choir Ensembles

HPPNet: Modeling the Harmonic Structure and Pitch Invariance in Piano Transcription

Reconstructing Human Expressiveness in Piano Performances with a Transformer Network

Harmonic Frequency-Separable Transformer for Instrument-Agnostic Music Transcription

Music source separation conditioned on 3D point clouds

Music Source Separation With Band-Split RNN

Music Source Separation in the Waveform Domain

Class-conditional Embeddings for Music Source Separation

Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training

Scoring Time Intervals using Non-Hierarchical Transformer For Automatic Piano Transcription

Music Source Separation with Band-Split RoPE Transformer

Hierarchic Temporal Convolutional Network With Cross-Domain Encoder for Music Source Separation

Unsupervised Single-Channel Music Source Separation by Average Harmonic Structure Modeling

SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation