Coordinate Embedding Transformer Model for Optical Music Recognition on Monophonic Scores

Changhai Zha,Zhiyong Huang,Shiqiang Zhu,Jiakai Zhu,Ling Zhong,Ruilong Du,Jason Gu
DOI: https://doi.org/10.1109/cyber55403.2022.9907643
2022-01-01
Abstract:Optical Music Recognition (OMR) is an image recognition task that researchers try to teach computers to read music notation. In recent years, the convolution recursive neural network algorithm achieves great success in music symbols recognition tasks, especially in the monophonic score. However, some challenges remain in music symbol recognition, such as the notes in different positions of the staff which have the same image features represent different meanings. It is hard to distinguish the notes only with the way of convolution. In addition, context relationship is usually used to improve the overall accuracy of the music symbol recognition. In this paper, we propose a Coordinate Embedding Transformer model(CETr). We add pixel coordinates into feature patches to make the note positions of the staff participate in training and predicting, which can increase the difference between two notes with the different positions of the staff. Due to the Transformer which is designed for sequence modeling and transduction tasks being reliable to deal with the context relationship in a music score, we leverage the Transformer architecture for symbols-level score generation. Experiments show that the CETr model outperforms the current state-of-the-art models on both clean and distorted monophonic scores.
What problem does this paper attempt to address?