Abstract:Extracting pitch information from music recordings is a challenging but important problem in music signal processing. Frame-wise transcription or multi-pitch estimation aims for detecting the simultaneous activity of pitches in polyphonic music recordings and has recently seen major improvements thanks to deep-learning techniques, with a variety of proposed network architectures. In this paper, we realize different architectures based on CNNs, the U-net structure, and self-attention components. We propose several modifications to these architectures including self-attention modules for skip connections, recurrent layers to replace the self-attention, and a multi-task strategy with simultaneous prediction of the degree of polyphony. We compare variants of these architectures in different sizes for multi-pitch estimation, focusing on Western classical music beyond the piano-solo scenario using the MusicNet and Schubert Winterreise datasets. Our experiments indicate that most architectures yield competitive results and that larger model variants seem to be beneficial. However, we find that these results substantially depend on randomization effects and the particular choice of the training-test split, which questions the claim of superiority for particular architectures given only small improvements. We therefore investigate the influence of dataset splits in the presence of several movements of a work cycle (cross-version evaluation) and propose a best-practice splitting strategy for MusicNet, which weakens the influence of individual test tracks and suppresses overfitting to specific works and recording conditions. A final evaluation on a mixed dataset suggests that improvements on one specific dataset do not necessarily generalize to other scenarios, thus emphasizing the need for further high-quality multi-pitch datasets in order to reliably measure progress in music transcription tasks.

Evaluation of CNN-based Automatic Music Tagging Models

An Experimental Comparison Of Multi-view Self-supervised Methods For Music Tagging

Audio-Based Music Classification with DenseNet And Data Augmentation

Music Genre Classification Based on Res-Gated CNN and Attention Mechanism

A Hybrid Parallel Computing Architecture Based on CNN and Transformer for Music Genre Classification

MuSLCAT: Multi-Scale Multi-Level Convolutional Attention Transformer for Discriminative Music Modeling on Raw Waveforms

Event Localization in Music Auto-tagging

Music Auto-Tagging with Robust Music Representation Learned via Domain Adversarial Training

musicnn: Pre-trained convolutional neural networks for music audio tagging

Evaluation of pretrained language models on music understanding

Evaluation System of Music Art Instructional Quality Based on Convolutional Neural Networks and Big Data Analysis

Meta learning based audio tagging.

Bottom-up broadcast neural network for music genre classification

Perceptual Musical Features for Interpretable Audio Tagging

Deep-Learning Architectures for Multi-Pitch Estimation: Towards Reliable Evaluation

Automatic Audio Chord Recognition with MIDI-Trained Deep Feature and BLSTM-CRF Sequence Decoding Model.

Audio Tagging with Compact Feedforward Sequential Memory Network and Audio-to-Audio Ratio Based Data Augmentation

MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers

Subjective Evaluation of Deep Learning Models for Symbolic Music Composition

Music Art Teaching Quality Evaluation System Based on Convolutional Neural Network

Hierarchical Attentive Deep Neural Networks for Semantic Music Annotation Through Multiple Music Representations