Stereo Feature Enhancement and Temporal Information Extraction Network for Automatic Music Transcription

Wen Zhang,Yonghui Zhang,Yanjun She,Jie Shao
DOI: https://doi.org/10.1109/lsp.2021.3099073
2021-01-01
IEEE Signal Processing Letters
Abstract:As a challenging task of audio processing, automatic music transcription (AMT) attracts increasing attention recently, which aims to convert a raw audio to a symbolic representation. Nowadays, music recordings are usually stereo audio files. Many previous studies simply average the stereo signal to a mono signal when processing data, which sacrifices some useful information. In this paper, we design a stereo feature enhancement (SFE) module based on self-attention mechanism to make full use of stereo information. Moreover, in recent years temporal convolutional network (TCN) has demonstrated great effect on processing temporal data, which overcomes some drawbacks of existing temporal information extraction methods such as HMM, RNN and LSTM. Inspired by this, we propose a temporal convolutional module (TCM) which is suitable to extract temporal context of music. Our proposed network is validated on the MAPS dataset for music transcription, and achieves ideal performance.
What problem does this paper attempt to address?