Abstract:In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music mOtion with Semantic Annotation) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians, comprising more than 30 hours and 570 K notes of data. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date. To demonstrate the usage of the MOSA dataset, we present several innovative cross-modal music information retrieval (MIR) and musical content generation tasks, including the detection of beats, downbeats, phrase, and expressive contents from audio, video and motion data, and the generation of musicians' body motion from given music audio. The dataset and codes are available alongside this publication (<a class="link-external link-https" href="https://github.com/yufenhuang/MOSA-Music-mOtion-and-Semantic-Annotation-dataset" rel="external noopener nofollow">this https URL</a>).

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to construct a high - quality, large - scale multimodal dataset to support Music Information Retrieval (MIR) and music content generation tasks in cross - modal music processing. Specifically, the author points out several main problems in existing datasets: 1. **Scarcity of professional semantic annotation**: In existing cross - modal music datasets, the professional semantic information with note - by - note manual annotation is very limited, which restricts the scale and application scope of the dataset. 2. **Scarcity of accurate 3 - D human motion data**: High - quality 3 - D motion - capture data is difficult to obtain because its collection requires a strict experimental environment and complex post - processing steps, resulting in a small amount of such data in existing datasets. 3. **Challenges in cross - modal data alignment**: The alignment of data in different modalities (such as audio, video, motion) in time is a complex and time - consuming task. Especially in different performance versions, due to differences in rhythm and expression, time units (such as beats and bars) can vary significantly. To solve these problems, the author proposes the MOSA (Music mOtion with Semantic Annotation) dataset, which is a cross - modal music dataset containing high - quality 3 - D motion - capture data, audio recordings, and detailed note - by - note semantic annotations. The characteristics of the MOSA dataset are as follows: - **Large - scale and high - quality**: The MOSA dataset contains performance data of 742 professional musicians (23 pianists and violinists), with a total duration of more than 30 hours and more than 570,000 notes. - **Rich semantic annotation**: Each note has detailed semantic annotations, including pitch, beat, phrase, dynamics, articulation, and harmony information. - **Accurate 3 - D motion - capture**: A 3 - D motion - capture system with 9 cameras is used to record the body movements of musicians and is accurately aligned with the audio data. By constructing the MOSA dataset, the author aims to provide a benchmark dataset for cross - modal music processing, thereby promoting the development of tasks such as Music Information Retrieval and music content generation. For example, through this dataset, one can study how to generate a musician's body movements from audio or generate background music from video. In summary, the core problem of this paper is to construct a large - scale, high - quality dataset that can support cross - modal music processing to address the deficiencies of existing datasets in semantic annotation, motion data quality, and cross - modal alignment.

MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music Processing

Popular Hooks: A Multimodal Dataset of Musical Hooks for Music Understanding and Generation

MusicTM-Dataset for Joint Representation Learning among Sheet Music, Lyrics, and Musical Audio

The PMEmo Dataset for Music Emotion Recognition

SingMOS: An extensive Open-Source Singing Voice Dataset for MOS Prediction

EMOPIA: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation

Audio Matters Too! Enhancing Markerless Motion Capture with Audio Signals for String Performance Capture

PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

Video Background Music Generation: Dataset, Method and Evaluation

MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence

Unaligned Supervision For Automatic Music Transcription in The Wild

Toward a More Complete OMR Solution

MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos

CCOM-HuQin: an Annotated Multimodal Chinese Fiddle Performance Dataset

In Search of a Dataset for Handwritten Optical Music Recognition: Introducing MUSCIMA++

PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text

MusicScore: A Dataset for Music Score Modeling and Generation

ComMU: Dataset for Combinatorial Music Generation

MoMusic: A Motion-Driven Human-AI Collaborative Music Composition and Performing System

MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music