Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models

Borodin Kirill Nikolayevich,Kudryavtsev Vasiliy Dmitrievich,Mkrtchian Grach Maratovich,Gorodnichev Mikhail Genadievich,Korzh Dmitrii Sergeevich
2024-06-27
Abstract:One of the most crucial components in the field of biometric security is the automatic speaker verification system, which is based on the speaker's voice. It is possible to utilise ASVs in isolation or in conjunction with other AI models. In the contemporary era, the quality and quantity of neural networks are increasing exponentially. Concurrently, there is a growing number of systems that aim to manipulate data through the use of voice conversion and text-to-speech models. The field of voice biometrics forgery is aided by a number of challenges, including SSTC, ASVSpoof, and SingFake. This paper presents a system for automatic speaker verification. The primary objective of our model is the extraction of embeddings from the target speaker's audio in order to obtain information about important characteristics of his voice, such as pitch, energy, and the duration of phonemes. This information is used in our multivoice TTS pipeline, which is currently under development. However, this model was employed within the SSTC challenge to verify users whose voice had undergone voice conversion, where it demonstrated an EER of 20.669.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the effectiveness of Automatic Speaker Verification (ASV) systems when facing voices processed by Voice Conversion (VC). Specifically, the paper focuses on how to extract feature embeddings from the target speaker's audio to obtain important characteristic information of their voice, such as pitch, energy, and phoneme duration, etc., and apply this information to the multi - voice Text - to - Speech (TTS) pipeline. In addition, the paper also explores how to use the ASV system to identify the original speaker in the voice after voice conversion, especially in the application of the SSTC challenge, which requires verification of voices that have undergone voice conversion. By constructing a model that can effectively distinguish different speakers and whose embedding information is sufficient to predict various voice features, including the duration of each phoneme, the paper improves the performance of the TTS system. At the same time, the paper also discusses the application of the model in detecting voice spoofing, which is of great significance for preventing malicious activities. By participating in the SSTC challenge, the paper demonstrates the ability of its model in verifying voices after voice conversion and achieves an Equal Error Rate (EER) of 20.669%.