Abstract:Sheet music recognition is a vital technology aimed at converting printed or handwritten musical scores into digital or machine-readable formats. The significance of this technology lies in making music compositions more accessible for editing, performance, learning, and sharing, thereby fostering music education, composition, and culture. It also provides a powerful tool for music analysis, research, and preservation. Our aim is to investigate a sheet music recognition method that offers a simple workflow, high recognition accuracy, and fast model convergence. Specifically, the proposed Deep Multilevel Cascade Residual Recurrent (MCRR) framework for sheet music recognition consists of the following components. Firstly, we introduce additive Gaussian white noise, additive Perlin noise, and elastic deformations such as rotation and stretching to simulate real-world noise in the sheet music images, thereby augmenting the dataset, enhancing model robustness, and mitigating overfitting. Secondly, in the feature extraction phase, we employ a residual Convolutional Neural Network (ConvNet) to address the issue of model degradation and use the multilevel cascade fusion technique to obtain comprehensive feature information, improving the model’s feature extraction capability and reducing recognition errors. For note recognition, we use a variant of RNN (Recurrent Neural Network) called SRU (Simple Recurrent Unit), which transforms most computations into parallel processing, speeding up model convergence. Finally, we combine the Connectionist Temporal Classification (CTC) loss function with SRU to eliminate the requirement for strict alignment between data and labels, enabling note classification and recognition. Extensive ablation experiments and comparative analyses, including visual analysis, intuitive illustrations, and quantitative assessments, confirm the effectiveness of the proposed method, demonstrating its superiority over various state-of-the-art methods. The proposed method achieved promising results in both the PrIMus and Camera-PrIMuS datasets. Specifically, in the PrIMus dataset, the method obtained an SeER (Symbol Error Rate) of 1.4571% and a SyER (System Error Rate) of 0.3234%. Notably, it demonstrated high accuracy in pitch, type, and note recognition, scoring approximately 97% in pitch and type accuracy and around 94% in note accuracy. The training time per epoch was relatively low, recorded at 0.56 seconds. In the case of the Camera-PrIMuS dataset, the method achieved slightly lower but still competitive results. It exhibited an SeER of 5.1488% and a SyER of 1.0612%, with pitch and type accuracies around 90%, and note accuracy at approximately 88%. The training time per epoch was slightly higher at 1.93 seconds Furthermore, we compare our method with existing commercial software, namely Capella-scan, PhotoScore, and SmartScore. Among these, Capella-scan delivers the best performance but exhibits lower robustness compared to the proposed method.

PKSpell: Data-Driven Pitch Spelling and Key Signature Estimation

Engraving Oriented Joint Estimation of Pitch Spelling and Local and Global Keys

Robust Multipitch Estimation Of Piano Sounds Using Deep Spiking Neural Networks

PDAugment: Data Augmentation by Pitch and Duration Adjustments for Automatic Lyrics Transcription

Polyphonic pitch detection with convolutional recurrent neural networks

RMVPE: A Robust Model for Vocal Pitch Estimation in Polyphonic Music

PESTO: Pitch Estimation with Self-supervised Transposition-equivariant Objective

HPPNet: Modeling the Harmonic Structure and Pitch Invariance in Piano Transcription

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

A Data-Driven Analysis of Robust Automatic Piano Transcription

Music SketchNet: Controllable Music Generation via Factorized Representations of Pitch and Rhythm

Acoustics-specific Piano Velocity Estimation

Towards Musically Informed Evaluation of Piano Transcription Models

Deep Multilevel Cascade Residual Recurrent Framework (MCRR) for Sheet Music Recognition

End-to-end Piano Performance-MIDI to Score Conversion with Transformers

End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding

An initial research: Towards accurate pitch extraction for speech synthesis based on BLSTM

Deep-Learning Architectures for Multi-Pitch Estimation: Towards Reliable Evaluation

DAFE-MSGAT: Dual-Attention Feature Extraction and Multi-Scale Graph Attention Network for Polyphonic Piano Transcription

PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text