Abstract:In this paper, the residual convolutional neural network is used to extract the note features in the music score image to solve the problem of model degradation; then, multiscale feature fusion is used to fuse the feature information of different levels in the same feature map to enhance the feature representation ability of the model. A network composed of a bidirectional simple loop unit and a chained time series classification function is used to identify notes, parallelizing a large number of calculations, thereby speeding up the convergence speed of training, which also makes the data in the dataset no longer need to be strict with labels. Alignment also reduces the requirements on the dataset. Aiming at the problem that the existing cross-modal retrieval methods based on common subspace are insufficient for mining local consistency within modalities, a cross-modal retrieval method fused with graph convolution is proposed. The K-nearest neighbor algorithm is used to construct modal graphs for samples of different modalities, and the original features of samples from different modalities are encoded through a symmetric graph convolutional coding network and a symmetric multilayer fully connected coding network, and the encoded features are fused and input. We jointly optimize the intramodal semantic constraints and intermodal modality-invariant constraints in the common subspace to learn highly locally consistent and semantically consistent common representations for samples from different modalities. The error value of the experimental results is used to illustrate the effect of parameters such as the number of iterations and the number of neurons on the network. In order to more accurately illustrate that the generated music sequence is very similar to the original music sequence, the generated music sequence is also framed, and finally the music sequence spectrogram and spectrogram are generated. The accuracy of the experiment is illustrated by comparing the spectrogram and the spectrogram, and genre classification predictions are also performed on the generated music to show that the network can generate music of different genres.

Triplet Convolutional Network for Music Version Identification.

Audio Feature Learning with Triplet-Based Embedding Network.

Key-Invariant Convolutional Neural Network Toward Efficient Cover Song Identification

Personalized Music Recommendation with Triplet Network

Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

Automatic Instrument Recognition in Polyphonic Music Using Convolutional Neural Networks

Coordinate Embedding Transformer Model for Optical Music Recognition on Monophonic Scores

Learning a Representation for Cover Song Identification Using Convolutional Neural Network

Similarity Learning for Cover Song Identification Using Cross-Similarity Matrices of Multi-Level Deep Sequences

Deep Ranking: Triplet MatchNet for Music Metric Learning

Learn A Robust Representation for Cover Song Identification Via Aggregating Local and Global Music Temporal Context.

Deep convolutional neural networks for predominant instrument recognition in polyphonic music

Temporal Pyramid Pooling Convolutional Neural Network for Cover Song Identification.

Convolutional Recurrent Neural Networks for Music Classification

Audio Cover Song Identification using Convolutional Neural Network

A Multitask Learning Approach for Chinese National Instruments Recognition and Timbre Space Regression

Improved Feature Pyramid Convolutional Neural Network for Effective Recognition of Music Scores

Design of Neural Network Model for Cross-Media Audio and Video Score Recognition Based on Convolutional Neural Network Model

Musical Audio Similarity with Self-supervised Convolutional Neural Networks

Improving Triplet-Wise Training Of Convolutional Neural Network For Vehicle Re-Identification

Audio-Based Music Classification with DenseNet And Data Augmentation