Abstract:Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random projection or existing neural codec, the proposed model, named MuQ, is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel spectrum quantization to enhance the stability and efficiency of target extraction and lead to better performance. Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models with only 0.9K hours of open-source pre-training data. Scaling up the data to over 160K hours and adopting iterative training consistently improve the model performance. To further validate the strength of our model, we present MuQ-MuLan, a joint music-text embedding model based on contrastive learning, which achieves state-of-the-art performance in the zero-shot music tagging task on the MagnaTagATune dataset. Code and checkpoints are open source in <a class="link-external link-https" href="https://github.com/tencent-ailab/MuQ" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key challenges in music understanding and representation. Specifically: 1. **Uniqueness of music modality**: - As a special modality, music focuses not only on semantic information (such as emotion, style), but also emphasizes acoustic information (such as melody, chord, tonality). Traditional self - supervised learning methods based on semantics are difficult to capture these two types of information simultaneously. - Formula representation: \[ \text{Music} = \text{Semantic Information} + \text{Acoustic Information} \] 2. **Limitations of existing models**: - Existing self - supervised learning (SSL) methods, such as MERT and MusicFM, have limited performance on music tasks. MERT relies on a complex neural codec (Encodec), which has high computational costs and requires an additional CQT reconstruction loss to capture acoustic features; while MusicFM relies on a random projection quantizer (BEST - RQ), and its performance is highly dependent on initialization. - Formula representation: \[ \text{MERT} = \text{Encodec} + \text{CQT Loss} \] - Formula representation: \[ \text{MusicFM} = \text{Random Projection Quantizer} \] 3. **Data efficiency**: - Many existing music understanding models require a large amount of training data to achieve better performance. For example, MERT and MusicFM usually need 160,000 hours of data for pre - training, while MuQ can achieve better results with only 900 hours of data. 4. **Cross - modal alignment**: - In terms of aligning music and text representations, although existing models such as MuLan perform well, they lack open - source code and datasets, which limits their wide application. MuQ - MuLan establishes a more effective joint embedding model between music and text through contrastive learning. ### Solutions To solve the above problems, the author proposes the following innovations: - **Mel Residual Vector Quantization (Mel - RVQ)**: - A lightweight single - layer linear projection structure is proposed to directly quantize the Mel spectrogram, which improves the stability and efficiency of target extraction. - Formula representation: \[ \text{Mel - RVQ} = \text{Single Linear Layer} + \text{Residual Vector Quantization} \] - **MuQ model**: - MuQ is a self - supervised music representation learning model based on Mel - RVQ, which can outperform existing SOTA models in multiple downstream tasks, even when only a small amount of pre - training data (900 hours) is used. - Formula representation: \[ \text{MuQ} = \text{Self - Supervised Learning} + \text{Mel - RVQ Targets} \] - **MuQ - MuLan model**: - MuQ - MuLan is a music - text joint embedding model built based on MuQ. It achieves the alignment of music and text modalities through contrastive learning and reaches SOTA performance in the zero - shot music tagging task. - Formula representation: \[ \text{MuQ - MuLan} = \text{MuQ} + \text{Contrastive Learning} + \text{Text Encoder} \] Through these innovations, the paper effectively solves multiple challenges in music understanding and provides new directions for future research.

MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

MuCodec: Ultra Low-Bitrate Music Codec

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

On the Effectiveness of Speech Self-supervised Learning for Music

QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language Models

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

OpenMU: Your Swiss Army Knife for Music Understanding

Self-Supervised Music Source Separation Using Vector-Quantized Source Category Estimates

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

An Experimental Comparison Of Multi-view Self-supervised Methods For Music Tagging

N-Gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding

Contrastive Learning with Positive-Negative Frame Mask for Music Representation

MuPT: A Generative Symbolic Music Pretrained Transformer

SSVMR: Saliency-Based Self-Training for Video-Music Retrieval.

Mel-S3R: Combining Mel-spectrogram and self-supervised speech representation with VQ-VAE for any-to-any voice conversion

Learning music audio representations via weak language supervision

MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers