Abstract:Sheet music recognition is a vital technology aimed at converting printed or handwritten musical scores into digital or machine-readable formats. The significance of this technology lies in making music compositions more accessible for editing, performance, learning, and sharing, thereby fostering music education, composition, and culture. It also provides a powerful tool for music analysis, research, and preservation. Our aim is to investigate a sheet music recognition method that offers a simple workflow, high recognition accuracy, and fast model convergence. Specifically, the proposed Deep Multilevel Cascade Residual Recurrent (MCRR) framework for sheet music recognition consists of the following components. Firstly, we introduce additive Gaussian white noise, additive Perlin noise, and elastic deformations such as rotation and stretching to simulate real-world noise in the sheet music images, thereby augmenting the dataset, enhancing model robustness, and mitigating overfitting. Secondly, in the feature extraction phase, we employ a residual Convolutional Neural Network (ConvNet) to address the issue of model degradation and use the multilevel cascade fusion technique to obtain comprehensive feature information, improving the model’s feature extraction capability and reducing recognition errors. For note recognition, we use a variant of RNN (Recurrent Neural Network) called SRU (Simple Recurrent Unit), which transforms most computations into parallel processing, speeding up model convergence. Finally, we combine the Connectionist Temporal Classification (CTC) loss function with SRU to eliminate the requirement for strict alignment between data and labels, enabling note classification and recognition. Extensive ablation experiments and comparative analyses, including visual analysis, intuitive illustrations, and quantitative assessments, confirm the effectiveness of the proposed method, demonstrating its superiority over various state-of-the-art methods. The proposed method achieved promising results in both the PrIMus and Camera-PrIMuS datasets. Specifically, in the PrIMus dataset, the method obtained an SeER (Symbol Error Rate) of 1.4571% and a SyER (System Error Rate) of 0.3234%. Notably, it demonstrated high accuracy in pitch, type, and note recognition, scoring approximately 97% in pitch and type accuracy and around 94% in note accuracy. The training time per epoch was relatively low, recorded at 0.56 seconds. In the case of the Camera-PrIMuS dataset, the method achieved slightly lower but still competitive results. It exhibited an SeER of 5.1488% and a SyER of 1.0612%, with pitch and type accuracies around 90%, and note accuracy at approximately 88%. The training time per epoch was slightly higher at 1.93 seconds Furthermore, we compare our method with existing commercial software, namely Capella-scan, PhotoScore, and SmartScore. Among these, Capella-scan delivers the best performance but exhibits lower robustness compared to the proposed method.

Automatic Audio Chord Recognition with MIDI-Trained Deep Feature and BLSTM-CRF Sequence Decoding Model.

Music Chord Recognition Based on Midi-Trained Deep Feature and BLSTM-CRF Hybird Decoding

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Polyphonic pitch detection with convolutional recurrent neural networks

N-Gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding

Large-vocabulary Chord Transcription Via Chord Structure Decomposition

Semi-supervised Neural Chord Estimation Based on a Variational Autoencoder with Latent Chord Labels and Features

Construction of Music Intelligent Creation Model Based on Convolutional Neural Network

Feature Learning for Chord Recognition: The Deep Chroma Extractor

DAFE-MSGAT: Dual-Attention Feature Extraction and Multi-Scale Graph Attention Network for Polyphonic Piano Transcription

Construction of AI Environmental Music Education Application Model Based on Deep Learning

RL-Chord: CLSTM-Based Melody Harmonization Using Deep Reinforcement Learning

Note Detection in Music Teaching Based on Intelligent Bidirectional Recurrent Neural Network

Audio-Based Music Classification with DenseNet And Data Augmentation

Improved Architecture for High-resolution Piano Transcription to Efficiently Capture Acoustic Characteristics of Music Signals

Striking a New Chord: Neural Networks in Music Information Dynamics

Generative Adversarial Network for Musical Notation Recognition during Music Teaching

A neural harmonic-aware network with gated attentive fusion for singing melody extraction

Deep Multilevel Cascade Residual Recurrent Framework (MCRR) for Sheet Music Recognition

Hierarchical Attentive Deep Neural Networks for Semantic Music Annotation Through Multiple Music Representations