Large-vocabulary Chord Transcription Via Chord Structure Decomposition
Junyan Jiang,Ke Chen,Wei Li,Gus Xia
DOI: https://doi.org/10.5281/zenodo.3527892
2019-01-01
Abstract:While audio chord recognition systems have acquired considerable accuracy on small vocabularies (e.g., major/minor chords), the large-vocabulary chord recognition problem still remains unsolved. This problem hinders the practical usages of audio recognition systems. The difficulty mainly lies in the intrinsic long-tail distribution of chord qualities, and most chord qualities have too few samples for model training. In this paper, we propose a new model for audio chord recognition under a huge chord vocabulary. The core concept is to decompose any chord label into a set of musically meaningful components (e.g., triad, bass, seventh), each with a much smaller vocabulary compared to the size of the overall chord vocabulary. A multitask classifier is then trained to recognize all the components given the audio feature, and then labels of individual components are reassembled to form the final chord label. Experiments show that the proposed system not only achieves state-of-the-art results on traditional evaluation metrics but also performs well on a large vocabulary. Large-vocabulary chord transcription is a difficult task, as the number of chord qualities is large, and the distribution of training chord classes is extremely biased. For example, the Billboard dataset [2], a human-annotated dataset, contains 230 different chord qualities, or equivalently, 2,749 distinct chord classes 1 . While the first 10% chord qualities cover 93.86% of the data, the last 50% chord qualities only cover 0.35% of the data altogether 2 . Such a longtailed chord distribution makes it extremely hard to model rare chord qualities. To bypass the problem, former systems typically adopt two kinds of strategies: chord quality simplification and 1 We here assume that each chord quality can be combined with all possible 12 roots except for the N chord. 2 In calculation, the chord quality counts are weighted by their durations. c © Junyan Jiang, et al. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Junyan Jiang, et al. “Large-Vocabulary Chord Transcription via Chord Structure Decomposition”, 20th International Society for Music Information Retrieval Conference, Delft, The Netherlands, 2019. maj min 7