Abstract:How learning affects vocalizations is a key question in the study of animal communication and human language. Parallel efforts in birds and humans have taught us much about how vocal learning works on a behavioural and neurobiological level. Subsequent efforts have revealed a variety of cases among mammals in which experience also has a major influence on vocal repertoires. Janik and Slater (Anim. Behav.60, 1-11. (doi:10.1006/anbe.2000.1410)) introduced the distinction between vocal usage and production learning, providing a general framework to categorize how different types of learning influence vocalizations. This idea was built on by Petkov and Jarvis (Front. Evol. Neurosci.4, 12. (doi:10.3389/fnevo.2012.00012)) to emphasize a more continuous distribution between limited and more complex vocal production learners. Yet, with more studies providing empirical data, the limits of the initial frameworks become apparent. We build on these frameworks to refine the categorization of vocal learning in light of advances made since their publication and widespread agreement that vocal learning is not a binary trait. We propose a novel classification system, based on the definitions by Janik and Slater, that deconstructs vocal learning into key dimensions to aid in understanding the mechanisms involved in this complex behaviour. We consider how vocalizations can change without learning, and a usage learning framework that considers context specificity and timing. We identify dimensions of vocal production learning, including the copying of auditory models (convergence/divergence on model sounds, accuracy of copying), the degree of change (type and breadth of learning) and timing (when learning takes place, the length of time it takes and how long it is retained). We consider grey areas of classification and current mechanistic understanding of these behaviours. Our framework identifies research needs and will help to inform neurobiological and evolutionary studies endeavouring to uncover the multi-dimensional nature of vocal learning. This article is part of the theme issue 'Vocal learning in animals and humans'.

Artificial Vocal Learning Guided by Speech Recognition: What It May Tell Us about How Children Learn to Speak

A model of infant speech perception and learning

Learn2Sing: Target Speaker Singing Voice Synthesis by learning from a Singing Teacher

Evaluating Features and Metrics for High-Quality Simulation of Early Vocal Learning of Vowels

Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings

Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications

Voice Synthesis Improvement by Machine Learning of Natural Prosody

Learning Singing From Speech

Automatic recognition of child speech for robotic applications in noisy environments

The multi-dimensional nature of vocal learning

Learning to Produce Syllabic Speech Sounds via Reward-Modulated Neural Plasticity

Learning Model-Based F0 Production Through Goal-Directed Babbling

Employing deep learning model to evaluate speech information in acoustic simulations of Cochlear implants

Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab and Convolutional Recurrent Neural Networks

Real-time Synthesis of Imagined Speech Processes from Minimally Invasive Recordings of Neural Activity

Employing Deep Learning Model to Evaluate Speech Information in Acoustic Simulations of Auditory Implants

Learning to imitate facial expressions through sound

Visualising Model Training via Vowel Space for Text-To-Speech Systems

Modeling early phonetic acquisition from child-centered audio data

Statistical Learning in Speech: A Biologically Based Predictive Learning Model

Unsupervised Inference of Physiologically Meaningful Articulatory Trajectories with VocalTractLab