The ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge: Tasks, Results and Findings

Kangxiang Xia,Dake Guo,Jixun Yao,Liumeng Xue,Hanzhao Li,Shuai Wang,Zhao Guo,Lei Xie,Qingqing Zhang,Lei Luo,Minghui Dong,Peng Sun
2024-10-31
Abstract:The ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge aims to benchmark and advance zero-shot spontaneous style voice cloning, particularly focusing on generating spontaneous behaviors in conversational speech. The challenge comprises two tracks: an unconstrained track without limitation on data and model usage, and a constrained track only allowing the use of constrained open-source datasets. A 100-hour high-quality conversational speech dataset is also made available with the challenge. This paper details the data, tracks, submitted systems, evaluation results, and findings.
Sound,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the **benchmarking and performance evaluation of zero - shot spontaneous conversational voice cloning**. Specifically, the paper focuses on the following aspects: 1. **Generating spontaneous behaviors in natural conversations**: Existing speech synthesis systems face challenges when dealing with spontaneous conversations, especially in maintaining the naturalness and expressiveness of speech. Therefore, the paper aims to evaluate and improve the performance of the system in generating natural conversations. 2. **Lack of consistent training and test datasets**: A common problem in large - scale zero - shot TTS systems is the lack of consistency between training and test datasets, which hinders direct comparison and accurate evaluation between different systems. To solve this problem, the paper provides standardized datasets and evaluation benchmarks. 3. **Evaluating the diversity and performance of systems**: By setting up two tracks (unrestricted track and restricted track), the paper hopes to comprehensively evaluate the performance of different methods and techniques in generating high - quality, natural - conversation - style speech. ### Main contributions of the paper - **Release of high - quality datasets**: A 100 - hour high - quality conversation speech dataset (HQ - Conversations) is provided to promote research. - **Design of standardized test sets**: A test set containing multiple text types is designed to ensure comprehensiveness and fairness of evaluation. - **Proposing evaluation metrics**: Objective and subjective evaluation metrics, such as Character Error Rate (CER), Pronunciation Similarity (SIM), and Mean Opinion Score (MOS) in multiple dimensions, are introduced to comprehensively evaluate system performance. ### Summary By organizing the ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge, this paper aims to promote the development of zero - shot spontaneous - conversation - style speech synthesis technology and provides the necessary datasets and evaluation criteria for this purpose. This not only helps the academic and industrial communities better understand and improve existing technologies but also lays a solid foundation for future research.