Label-Specific Time-Frequency Energy-Based Neural Network for Instrument Recognition

Jian Zhang,Tong Wei,Min-Ling Zhang
DOI: https://doi.org/10.1109/TCYB.2024.3433519
2024-08-19
Abstract:Predominant instrument recognition plays a vital role in music information retrieval. This task involves identifying and categorizing the dominant instruments present in a piece of music based on their distinctive time-frequency characteristics and harmonic distribution. Existing predominant instrument recognition approaches mainly focus on learning implicit mappings (such as deep neural networks) from time-domain or frequency-domain representations of music audio to instrument labels. However, different instruments playing in polyphonic music produce local superposed time-frequency representations while most implicit models could be sensitive to such local data changes. This thus poses a challenge for these implicit methods to accurately capture the unique harmonic features of each instrument. To address this challenge, considering that the complete harmonic information of an instrument is also distributed across a wide range of frequencies, we design a label-specific time-frequency feature learning approach to convert the task of building implicit classification mappings into the process of extracting and matching features that are specific to each instrument, as a result, a new explicit learning model: label-specific time-frequency energy-based neural network (LSTN) is proposed. Unlike existing implicit models, LSTN not only extracts their commonly used local time-frequency features but also incorporates time-domain factors and frequency-domain factors in its energy function to explicitly parameterize the long-term correlation and long-frequency correlation features. Using the extracted time-frequency features and the two long correlation features as instrument label-specific features, LSTN detects whether the harmonic distribution of each instrument appears in polyphonic music on both long time-frequency scales and local time-frequency scales to mitigate the challenges posed by local superposed representations. We conduct an analysis of the complexity and the convergence of LSTN, then experiments conducted on benchmark datasets demonstrate the superiority of LSTN over other established instrument recognition algorithms.
What problem does this paper attempt to address?