MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization

Haina Zhu,Yizhi Zhou,Hangting Chen,Jianwei Yu,Ziyang Ma,Rongzhi Gu,Wei Tan,Xie Chen
2025-01-02
Abstract:Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random projection or existing neural codec, the proposed model, named MuQ, is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel spectrum quantization to enhance the stability and efficiency of target extraction and lead to better performance. Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models with only 0.9K hours of open-source pre-training data. Scaling up the data to over 160K hours and adopting iterative training consistently improve the model performance. To further validate the strength of our model, we present MuQ-MuLan, a joint music-text embedding model based on contrastive learning, which achieves state-of-the-art performance in the zero-shot music tagging task on the MagnaTagATune dataset. Code and checkpoints are open source in <a class="link-external link-https" href="https://github.com/tencent-ailab/MuQ" rel="external noopener nofollow">this https URL</a>.
Sound,Artificial Intelligence,Computation and Language,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges in music understanding and representation. Specifically: 1. **Uniqueness of music modality**: - As a special modality, music focuses not only on semantic information (such as emotion, style), but also emphasizes acoustic information (such as melody, chord, tonality). Traditional self - supervised learning methods based on semantics are difficult to capture these two types of information simultaneously. - Formula representation: \[ \text{Music} = \text{Semantic Information} + \text{Acoustic Information} \] 2. **Limitations of existing models**: - Existing self - supervised learning (SSL) methods, such as MERT and MusicFM, have limited performance on music tasks. MERT relies on a complex neural codec (Encodec), which has high computational costs and requires an additional CQT reconstruction loss to capture acoustic features; while MusicFM relies on a random projection quantizer (BEST - RQ), and its performance is highly dependent on initialization. - Formula representation: \[ \text{MERT} = \text{Encodec} + \text{CQT Loss} \] - Formula representation: \[ \text{MusicFM} = \text{Random Projection Quantizer} \] 3. **Data efficiency**: - Many existing music understanding models require a large amount of training data to achieve better performance. For example, MERT and MusicFM usually need 160,000 hours of data for pre - training, while MuQ can achieve better results with only 900 hours of data. 4. **Cross - modal alignment**: - In terms of aligning music and text representations, although existing models such as MuLan perform well, they lack open - source code and datasets, which limits their wide application. MuQ - MuLan establishes a more effective joint embedding model between music and text through contrastive learning. ### Solutions To solve the above problems, the author proposes the following innovations: - **Mel Residual Vector Quantization (Mel - RVQ)**: - A lightweight single - layer linear projection structure is proposed to directly quantize the Mel spectrogram, which improves the stability and efficiency of target extraction. - Formula representation: \[ \text{Mel - RVQ} = \text{Single Linear Layer} + \text{Residual Vector Quantization} \] - **MuQ model**: - MuQ is a self - supervised music representation learning model based on Mel - RVQ, which can outperform existing SOTA models in multiple downstream tasks, even when only a small amount of pre - training data (900 hours) is used. - Formula representation: \[ \text{MuQ} = \text{Self - Supervised Learning} + \text{Mel - RVQ Targets} \] - **MuQ - MuLan model**: - MuQ - MuLan is a music - text joint embedding model built based on MuQ. It achieves the alignment of music and text modalities through contrastive learning and reaches SOTA performance in the zero - shot music tagging task. - Formula representation: \[ \text{MuQ - MuLan} = \text{MuQ} + \text{Contrastive Learning} + \text{Text Encoder} \] Through these innovations, the paper effectively solves multiple challenges in music understanding and provides new directions for future research.