MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music Processing

Yu-Fen Huang,Nikki Moran,Simon Coleman,Jon Kelly,Shun-Hwa Wei,Po-Yin Chen,Yun-Hsin Huang,Tsung-Ping Chen,Yu-Chia Kuo,Yu-Chi Wei,Chih-Hsuan Li,Da-Yu Huang,Hsuan-Kai Kao,Ting-Wei Lin,Li Su
DOI: https://doi.org/10.1109/TASLP.2024.3407529
2024-06-10
Abstract:In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music mOtion with Semantic Annotation) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians, comprising more than 30 hours and 570 K notes of data. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date. To demonstrate the usage of the MOSA dataset, we present several innovative cross-modal music information retrieval (MIR) and musical content generation tasks, including the detection of beats, downbeats, phrase, and expressive contents from audio, video and motion data, and the generation of musicians' body motion from given music audio. The dataset and codes are available alongside this publication (<a class="link-external link-https" href="https://github.com/yufenhuang/MOSA-Music-mOtion-and-Semantic-Annotation-dataset" rel="external noopener nofollow">this https URL</a>).
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to construct a high - quality, large - scale multimodal dataset to support Music Information Retrieval (MIR) and music content generation tasks in cross - modal music processing. Specifically, the author points out several main problems in existing datasets: 1. **Scarcity of professional semantic annotation**: In existing cross - modal music datasets, the professional semantic information with note - by - note manual annotation is very limited, which restricts the scale and application scope of the dataset. 2. **Scarcity of accurate 3 - D human motion data**: High - quality 3 - D motion - capture data is difficult to obtain because its collection requires a strict experimental environment and complex post - processing steps, resulting in a small amount of such data in existing datasets. 3. **Challenges in cross - modal data alignment**: The alignment of data in different modalities (such as audio, video, motion) in time is a complex and time - consuming task. Especially in different performance versions, due to differences in rhythm and expression, time units (such as beats and bars) can vary significantly. To solve these problems, the author proposes the MOSA (Music mOtion with Semantic Annotation) dataset, which is a cross - modal music dataset containing high - quality 3 - D motion - capture data, audio recordings, and detailed note - by - note semantic annotations. The characteristics of the MOSA dataset are as follows: - **Large - scale and high - quality**: The MOSA dataset contains performance data of 742 professional musicians (23 pianists and violinists), with a total duration of more than 30 hours and more than 570,000 notes. - **Rich semantic annotation**: Each note has detailed semantic annotations, including pitch, beat, phrase, dynamics, articulation, and harmony information. - **Accurate 3 - D motion - capture**: A 3 - D motion - capture system with 9 cameras is used to record the body movements of musicians and is accurately aligned with the audio data. By constructing the MOSA dataset, the author aims to provide a benchmark dataset for cross - modal music processing, thereby promoting the development of tasks such as Music Information Retrieval and music content generation. For example, through this dataset, one can study how to generate a musician's body movements from audio or generate background music from video. In summary, the core problem of this paper is to construct a large - scale, high - quality dataset that can support cross - modal music processing to address the deficiencies of existing datasets in semantic annotation, motion data quality, and cross - modal alignment.