Abstract:While audio chord recognition systems have acquired considerable accuracy on small vocabularies (e.g., major/minor chords), the large-vocabulary chord recognition problem still remains unsolved. This problem hinders the practical usages of audio recognition systems. The difficulty mainly lies in the intrinsic long-tail distribution of chord qualities, and most chord qualities have too few samples for model training. In this paper, we propose a new model for audio chord recognition under a huge chord vocabulary. The core concept is to decompose any chord label into a set of musically meaningful components (e.g., triad, bass, seventh), each with a much smaller vocabulary compared to the size of the overall chord vocabulary. A multitask classifier is then trained to recognize all the components given the audio feature, and then labels of individual components are reassembled to form the final chord label. Experiments show that the proposed system not only achieves state-of-the-art results on traditional evaluation metrics but also performs well on a large vocabulary. Large-vocabulary chord transcription is a difficult task, as the number of chord qualities is large, and the distribution of training chord classes is extremely biased. For example, the Billboard dataset [2], a human-annotated dataset, contains 230 different chord qualities, or equivalently, 2,749 distinct chord classes 1 . While the first 10% chord qualities cover 93.86% of the data, the last 50% chord qualities only cover 0.35% of the data altogether 2 . Such a longtailed chord distribution makes it extremely hard to model rare chord qualities. To bypass the problem, former systems typically adopt two kinds of strategies: chord quality simplification and 1 We here assume that each chord quality can be combined with all possible 12 roots except for the N chord. 2 In calculation, the chord quality counts are weighted by their durations. c © Junyan Jiang, et al. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Junyan Jiang, et al. “Large-Vocabulary Chord Transcription via Chord Structure Decomposition”, 20th International Society for Music Information Retrieval Conference, Delft, The Netherlands, 2019. maj min 7

A genre-independent chord transcription system from audio using GMM-based HMMs

Genre-conditioned Acoustic Models for Automatic Lyrics Transcription of Polyphonic Music

Investigation on the use of Hidden-Markov Models in automatic transcription of music

Music Chord Recognition Based on Midi-Trained Deep Feature and BLSTM-CRF Hybird Decoding

Automatic Audio Chord Recognition with MIDI-Trained Deep Feature and BLSTM-CRF Sequence Decoding Model.

Large-vocabulary Chord Transcription Via Chord Structure Decomposition

Calibration of a two-state pitch-wise HMM method for note segmentation in Automatic Music Transcription systems

A multimodal approach to music transcription

Automatic Lyrics Alignment and Transcription in Polyphonic Music: Does Background Music Help?

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Play as You Like: Timbre-enhanced Multi-modal Music Style Transfer

Music Genres Classification Using Text Categorization Method

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

New Approach to Classification of Chinese Folk Music Based on Extension of HMM

Computational music analysis from first principles

Automatic Lyrics Transcription of Polyphonic Music With Lyrics-Chord Multi-Task Learning

Rhythm Transcription of Polyphonic Piano Music Based on Merged-Output HMM for Multiple Voices

Vertical-Horizontal Structured Attention for Generating Music with Chords

JamBot: Music Theory Aware Chord Based Generation of Polyphonic Music with LSTMs

Audio-to-symbolic Arrangement via Cross-modal Music Representation Learning