Abstract:Sheet music recognition is a vital technology aimed at converting printed or handwritten musical scores into digital or machine-readable formats. The significance of this technology lies in making music compositions more accessible for editing, performance, learning, and sharing, thereby fostering music education, composition, and culture. It also provides a powerful tool for music analysis, research, and preservation. Our aim is to investigate a sheet music recognition method that offers a simple workflow, high recognition accuracy, and fast model convergence. Specifically, the proposed Deep Multilevel Cascade Residual Recurrent (MCRR) framework for sheet music recognition consists of the following components. Firstly, we introduce additive Gaussian white noise, additive Perlin noise, and elastic deformations such as rotation and stretching to simulate real-world noise in the sheet music images, thereby augmenting the dataset, enhancing model robustness, and mitigating overfitting. Secondly, in the feature extraction phase, we employ a residual Convolutional Neural Network (ConvNet) to address the issue of model degradation and use the multilevel cascade fusion technique to obtain comprehensive feature information, improving the model’s feature extraction capability and reducing recognition errors. For note recognition, we use a variant of RNN (Recurrent Neural Network) called SRU (Simple Recurrent Unit), which transforms most computations into parallel processing, speeding up model convergence. Finally, we combine the Connectionist Temporal Classification (CTC) loss function with SRU to eliminate the requirement for strict alignment between data and labels, enabling note classification and recognition. Extensive ablation experiments and comparative analyses, including visual analysis, intuitive illustrations, and quantitative assessments, confirm the effectiveness of the proposed method, demonstrating its superiority over various state-of-the-art methods. The proposed method achieved promising results in both the PrIMus and Camera-PrIMuS datasets. Specifically, in the PrIMus dataset, the method obtained an SeER (Symbol Error Rate) of 1.4571% and a SyER (System Error Rate) of 0.3234%. Notably, it demonstrated high accuracy in pitch, type, and note recognition, scoring approximately 97% in pitch and type accuracy and around 94% in note accuracy. The training time per epoch was relatively low, recorded at 0.56 seconds. In the case of the Camera-PrIMuS dataset, the method achieved slightly lower but still competitive results. It exhibited an SeER of 5.1488% and a SyER of 1.0612%, with pitch and type accuracies around 90%, and note accuracy at approximately 88%. The training time per epoch was slightly higher at 1.93 seconds Furthermore, we compare our method with existing commercial software, namely Capella-scan, PhotoScore, and SmartScore. Among these, Capella-scan delivers the best performance but exhibits lower robustness compared to the proposed method.

SymforNet: application of cross-modal information correspondences based on self-supervision in symbolic music generation

Symphony Generation with Permutation Invariant Language Model

Generating Symbolic Music from Natural Language Prompts using an LLM-Enhanced Dataset

PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network

PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation

Subjective Evaluation of Deep Learning Models for Symbolic Music Composition

Design of Neural Network Model for Cross-Media Audio and Video Score Recognition Based on Convolutional Neural Network Model

Generative Adversarial Network for Musical Notation Recognition during Music Teaching

MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation

N-Gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding

SymPAC: Scalable Symbolic Music Generation With Prompts And Constraints

Mode-conditioned music learning and composition: a spiking neural network inspired by neuroscience and psychology

2019 Formatting Instructions for Authors Using LaTeX

Combinatorial music generation model with song structure graph analysis

Demonstration of PerformanceNet: A Convolutional Neural Network Model for Score-to-Audio Music Generation

Deep Multilevel Cascade Residual Recurrent Framework (MCRR) for Sheet Music Recognition

Supervised Symbolic Music Style Translation Using Synthetic Data

A Survey on Deep Learning for Symbolic Music Generation: Representations, Algorithms, Evaluations, and Challenges

MelodyGLM: Multi-task Pre-training for Symbolic Melody Generation

Score Images as a Modality: Enhancing Symbolic Music Understanding through Large-Scale Multimodal Pre-Training