Abstract:This study presents FruitsMusic, a metadata corpus of Japanese idol-group songs in the real world, precisely annotated with who sings what and when. Japanese idol-group songs, vital to Japanese pop culture, feature a unique vocal arrangement style, where songs are divided into several segments, and a specific individual or multiple singers are assigned to each segment. To enhance singer diarization methods for recognizing such structures, we constructed FruitsMusic as a resource using 40 music videos of Japanese idol groups from YouTube. The corpus includes detailed annotations, covering songs across various genres, division and assignment styles, and groups ranging from 4 to 9 members. FruitsMusic also facilitates the development of various music information retrieval techniques, such as lyrics transcription and singer identification, benefiting not only Japanese idol-group songs but also a wide range of songs featuring single or multiple singers from various cultures. This paper offers a comprehensive overview of FruitsMusic, including its creation methodology and unique characteristics compared to conversational speech. Additionally, this paper evaluates the efficacy of current methods for singer embedding extraction and diarization in challenging real-world conditions using FruitsMusic. Furthermore, this paper examines potential improvements in automatic diarization performance through evaluating human performance.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problems of singer identification and segmentation in Japanese idol group songs, especially by constructing a real - world corpus named **FruitsMusic** to improve the singer diarization technology. Specifically: 1. **Singer diarization problem**: Singer diarization refers to identifying "who is singing at what time" from the music signal. This task is crucial for understanding the structure and expression of idol group songs. Most of the existing research is based on virtual idol songs in games and anime, and these songs have a relatively single style and are easy to distinguish. However, idol group songs in the real world have more complex song divisions, so more challenging datasets are required for research. 2. **Lack of real - world datasets**: Existing research mainly relies on virtual idol songs, and there are significant differences between these songs and idol group songs in the real world. To fill this gap, the author constructed the **FruitsMusic** dataset, which contains real - world idol group songs from YouTube and is annotated in detail with which singers sing each segment. 3. **Multimodal information processing**: Idol group songs not only contain audio information but also involve multiple modal information such as video content and lyrics. Therefore, the design of the FruitsMusic dataset also takes into account the needs of multimodal processing, such as multimodal diarization. 4. **Application of Music Information Retrieval (MIR) technology**: In addition to singer diarization, FruitsMusic can also be used to develop and evaluate other MIR technologies, such as lyrics transcription, emotion classification, singer identification, etc. This helps to improve the understanding and processing ability of single - or multi - person - sung songs in various cultural and linguistic backgrounds. ### Characteristics of the FruitsMusic dataset - **Real - world data**: FruitsMusic contains 40 real - world idol group songs from YouTube, covering different styles and song division methods. - **Detailed annotation**: Each song is annotated in detail with which singers sing each segment, as well as the specific start and end times. - **Diversity and complexity**: The dataset covers idol groups with 4 to 9 members, ensuring the diversity and complexity of the data. - **Wide application**: It is not only applicable to Japanese idol group songs but can also be extended to multi - person - sung songs in other cultural backgrounds. ### Research methods 1. **Data collection and annotation**: 40 idol group songs were collected from YouTube, and the singer information of each segment was recorded by manual annotation. 2. **Model training and evaluation**: The FruitsMusic dataset was used to train and evaluate singer diarization models, including methods such as Self - Attention End - to - End Neural Diarization (SA - EEND) and pyannote.audio. 3. **Human evaluation**: A human evaluator was invited to perform manual diarization to evaluate the performance of the automatic system. ### Conclusion By constructing the FruitsMusic dataset, the author has successfully solved the problem of lack of real - world idol group song data in existing research and provided strong support for the development of singer diarization and other MIR technologies. The experimental results show that the pipeline system combining source separation and diarization performs better when dealing with complex real - world songs, and the introduction of the FruitsMusic dataset significantly improves the model performance. --- If you have more specific questions about this paper or need further interpretation, please feel free to let us know!

FruitsMusic: A Real-World Corpus of Japanese Idol-Group Songs

Mind Band: A Crossmedia AI Music Composing Platform

JaCappella Corpus: A Japanese a Cappella Vocal Ensemble Corpus

JVS-MuSiC: Japanese multispeaker singing-voice corpus

Popular Hooks: A Multimodal Dataset of Musical Hooks for Music Understanding and Generation

The PMEmo Dataset for Music Emotion Recognition

PJS: phoneme-balanced Japanese singing voice corpus

The WASABI song corpus and knowledge graph for music lyrics analysis

A Dataset for Learning Stylistic and Cultural Correlations Between Music and Videos

FMA: A Dataset For Music Analysis

Analysis and Detection of Singing Techniques in Repertoires of J-POP Solo Singers

CNAMD Corpus: A Chinese Natural Audiovisual Multimodal Database of Conversations for Social Interactive Agents

A study on the selection of Japanese popular songs suitable for high school Japanese language teaching using text mining

A Novel Framework for Efficient Automated Singer Identification in Large Music Databases

Commodifying adolescence for performance and profit: Language and gender in Japanese idol music

Love Me, Love Me, Say (and Write!) that You Love Me: Enriching the WASABI Song Corpus with Lyrics Annotations

ChoralSynth: Synthetic Dataset of Choral Singing

Creating an A Cappella Singing Audio Dataset for Automatic Jingju Singing Evaluation Research

POP909: A Pop-song Dataset for Music Arrangement Generation

East Asian pop music idol production and the emergence of data fandom in China

MusicTM-Dataset for Joint Representation Learning among Sheet Music, Lyrics, and Musical Audio