QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation
Chang Li,Ruoyu Wang,Lijuan Liu,Jun Du,Yixuan Sun,Zilu Guo,Zhenrong Zhang,Yuan Jiang
2024-08-20
Abstract:In recent years, diffusion-based text-to-music (TTM) generation has gained prominence, offering an innovative approach to synthesizing musical content from textual descriptions. Achieving high accuracy and diversity in this generation process requires extensive, high-quality data, including both high-fidelity audio waveforms and detailed text descriptions, which often constitute only a small portion of available datasets. In open-source datasets, issues such as low-quality music waveforms, mislabeling, weak labeling, and unlabeled data significantly hinder the development of music generation models. To address these challenges, we propose a novel paradigm for high-quality music generation that incorporates a quality-aware training strategy, enabling generative models to discern the quality of input music waveforms during training. Leveraging the unique properties of musical signals, we first adapted and implemented a masked diffusion transformer (MDT) model for the TTM task, demonstrating its distinct capacity for quality control and enhanced musicality. Additionally, we address the issue of low-quality captions in TTM with a caption refinement data processing approach. Experiments demonstrate our state-of-the-art (SOTA) performance on MusicCaps and the Song-Describer Dataset. Our demo page can be accessed at <a class="link-external link-https" href="https://qa-mdt.github.io/" rel="external noopener nofollow">this https URL</a>.
Sound,Artificial Intelligence,Audio and Speech Processing