Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals

Kentaro Seki,Shinnosuke Takamichi,Norihiro Takamune,Yuki Saito,Kanami Imamura,Hiroshi Saruwatari
2024-06-26
Abstract:This paper proposes a new task called spatial voice conversion, which aims to convert a target voice while preserving spatial information and non-target signals. Traditional voice conversion methods focus on single-channel waveforms, ignoring the stereo listening experience inherent in human hearing. Our baseline approach addresses this gap by integrating blind source separation (BSS), voice conversion (VC), and spatial mixing to handle multi-channel waveforms. Through experimental evaluations, we organize and identify the key challenges inherent in this task, such as maintaining audio quality and accurately preserving spatial information. Our results highlight the fundamental difficulties in balancing these aspects, providing a benchmark for future research in spatial voice conversion. The proposed method's code is publicly available to encourage further exploration in this domain.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper proposes a new task called **Spatial Voice Conversion (SVC)**, which aims to convert target speech while preserving spatial information and non-target signals. Traditional voice conversion methods mainly focus on monaural waveforms, ignoring the inherent stereo experience in human hearing. This paper addresses this gap by processing multichannel waveforms through the integration of Blind Source Separation (BSS), Voice Conversion (VC), and spatial mixing techniques. ### Main Challenges 1. **Maintaining Audio Quality**: Ensuring that the overall audio quality is not compromised while converting the target speech. 2. **Accurate Preservation of Spatial Information**: Accurately preserving the spatial information of the original audio, such as echoes and Direction of Arrival (DoA), during the conversion process. 3. **Retention of Non-Target Signals**: Keeping other non-target signals (such as background sounds and other speakers' voices) unchanged besides the target speech. ### Experimental Evaluation Through experimental evaluation, the authors identified key challenges in this task, such as balancing the maintenance of audio quality and the accurate preservation of spatial information. The experimental results indicate fundamental difficulties in these aspects, providing a benchmark for future research. ### Method Overview 1. **Blind Source Separation (BSS)**: Used to separate the target speech and non-target signals from the mixed audio. 2. **Voice Conversion (VC)**: Applied only to the separated target speech. 3. **Spatial Mixing**: Recombines the converted target speech and the unchanged non-target signals into a multichannel audio output. ### Conclusion This paper introduces the task of spatial voice conversion and demonstrates its challenges through experiments. Despite the limited experimental conditions, the study finds that balancing audio quality and spatial information reproduction is a fundamental difficulty. Therefore, future research needs to explore whether spatial voice conversion can achieve both high audio quality and accurate spatial information reproduction.