Abstract:Most studies investigating neural representations of species-specific vocalizations in non-human primates and other species have involved studying neural responses to vocalization tokens. One limitation of such approaches is the difficulty in determining which acoustical features of vocalizations evoke neural responses. Traditionally used filtering techniques are often inadequate in manipulating features of complex vocalizations. Furthermore, the use of vocalization tokens cannot fully account for intrinsic stochastic variations of vocalizations that are crucial in understanding the neural codes for categorizing and discriminating vocalizations differing along multiple feature dimensions. In this work, we have taken a rigorous and novel approach to the study of species-specific vocalization processing by creating parametric "virtual vocalization" models of major call types produced by the common marmoset (Callithrix jacchus). The main findings are as follows. 1) Acoustical parameters were measured from a database of the four major call types of the common marmoset. This database was obtained from eight different individuals, and for each individual, we typically obtained hundreds of samples of each major call type. 2) These feature measurements were employed to parameterize models defining representative virtual vocalizations of each call type for each of the eight animals as well as an overall species-representative virtual vocalization averaged across individuals for each call type. 3) Using the same feature-measurement that was applied to the vocalization samples, we measured acoustical features of the virtual vocalizations, including features not explicitly modeled and found the virtual vocalizations to be statistically representative of the callers and call types. 4) The accuracy of the virtual vocalizations was further confirmed by comparing neural responses to real and synthetic virtual vocalizations recorded from awake marmoset auditory cortex. We found a strong agreement between the responses to token vocalizations and their synthetic counterparts. 5) We demonstrated how these virtual vocalization stimuli could be employed to precisely and quantitatively define the notion of vocalization "selectivity" by using stimuli with parameter values both within and outside the naturally occurring ranges. We also showed the potential of the virtual vocalization stimuli in studying issues related to vocalization categorizations by morphing between different call types and individual callers.

Animal speech and singing synthesis model based on So-VITS-SVC

A Survey on Recent Deep Learning-driven Singing Voice Synthesis Systems

An Initial study on Birdsong Re-synthesis Using Neural Vocoders

VITS-based Singing Voice Conversion System with DSPGAN post-processing for SVCC2023

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

A Synthetic Corpus Generation Method for Neural Vocoder Training

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

Speak Like a Dog: Human to Non-human creature Voice Conversion

Virtual Vocalization Stimuli for Investigating Neural Representations of Species-Specific Vocalizations.

V2C: Visual Voice Cloning

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling

NNSVS: A Neural Network-Based Singing Voice Synthesis Toolkit

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

A Systematic Exploration of Joint-training for Singing Voice Synthesis

VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis

Towards Lexical Analysis of Dog Vocalizations via Online Videos

Enhancing Vocal Performance using Variational Onsager Neural Network and Optimized with Golden Search Optimization Algorithm

SingGAN: Generative Adversarial Network for High-Fidelity Singing Voice Generation

Voice Synthesis Improvement by Machine Learning of Natural Prosody