Abstract:How can deep neural networks encode information that corresponds to words in human speech into raw acoustic data? This paper proposes two neural network architectures for modeling unsupervised lexical learning from raw acoustic inputs, ciwGAN (Categorical InfoWaveGAN) and fiwGAN (Featural InfoWaveGAN), that combine a Deep Convolutional GAN architecture for audio data (WaveGAN; <a class="link-https" data-arxiv-id="1705.07904" href="https://arxiv.org/abs/1705.07904">arXiv:1705.07904</a>) with an information theoretic extension of GAN -- InfoGAN (<a class="link-https" data-arxiv-id="1606.03657" href="https://arxiv.org/abs/1606.03657">arXiv:1606.03657</a>), and propose a new latent space structure that can model featural learning simultaneously with a higher level classification and allows for a very low-dimension vector representation of lexical items. Lexical learning is modeled as emergent from an architecture that forces a deep neural network to output data such that unique information is retrievable from its acoustic outputs. The networks trained on lexical items from TIMIT learn to encode unique information corresponding to lexical items in the form of categorical variables in their latent space. By manipulating these variables, the network outputs specific lexical items. The network occasionally outputs innovative lexical items that violate training data, but are linguistically interpretable and highly informative for cognitive modeling and neural network interpretability. Innovative outputs suggest that phonetic and phonological representations learned by the network can be productively recombined and directly paralleled to productivity in human speech: a fiwGAN network trained on `suit' and `dark' outputs innovative `start', even though it never saw `start' or even a [st] sequence in the training data. We also argue that setting latent featural codes to values well beyond training range results in almost categorical generation of prototypical lexical items and reveals underlying values of each latent code.

Articulation GAN: Unsupervised modeling of articulatory learning

Generative Adversarial Phonology: Modeling Unsupervised Phonetic and Phonological Learning With Neural Networks

Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis

CiwaGAN: Articulatory information exchange

Modeling speech recognition and synthesis simultaneously: Encoding and decoding lexical and sublexical semantic information into speech with no direct access to speech data

Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents

Local and non-local dependency learning and emergence of rule-like representations in speech data by Deep Convolutional Generative Adversarial Networks

Disentanglement in a GAN for Unconditional Speech Synthesis

Bidirectional Generative Adversarial Representation Learning for Natural Stimulus Synthesis

A Model of Emotional Speech Generation Based on Conditional Generative Adversarial Networks

Analyzing Input and Output Representations for Speech-Driven Gesture Generation

Temporal conditional Wasserstein GANs for audio-visual affect-related ties

DeepNAG: Deep Non-Adversarial Gesture Generation

GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

CiwGAN and fiwGAN: Encoding information in acoustic data to model lexical learning with Generative Adversarial Networks

Analysis by Adversarial Synthesis -- A Novel Approach for Speech Vocoding

Targeted Speech Adversarial Example Generation With Generative Adversarial Network

High Fidelity Speech Synthesis with Adversarial Networks

GANterpretations