Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? -- A computational investigation
Khazar Khorrami,Okko Räsänen
DOI: https://doi.org/10.34842/w3vw-s845
2021-09-29
Abstract:Decades of research has studied how language learning infants learn to
discriminate speech sounds, segment words, and associate words with their
meanings. While gradual development of such capabilities is unquestionable, the
exact nature of these skills and the underlying mental representations yet
remains unclear. In parallel, computational studies have shown that basic
comprehension of speech can be achieved by statistical learning between speech
and concurrent referentially ambiguous visual input. These models can operate
without prior linguistic knowledge such as representations of linguistic units,
and without learning mechanisms specifically targeted at such units. This has
raised the question of to what extent knowledge of linguistic units, such as
phone(me)s, syllables, and words, could actually emerge as latent
representations supporting the translation between speech and representations
in other modalities, and without the units being proximal learning targets for
the learner. In this study, we formulate this idea as the so-called latent
language hypothesis (LLH), connecting linguistic representation learning to
general predictive processing within and across sensory modalities. We review
the extent that the audiovisual aspect of LLH is supported by the existing
computational studies. We then explore LLH further in extensive learning
simulations with different neural network models for audiovisual
cross-situational learning, and comparing learning from both synthetic and real
speech data. We investigate whether the latent representations learned by the
networks reflect phonetic, syllabic, or lexical structure of input speech by
utilizing an array of complementary evaluation metrics related to linguistic
selectivity and temporal characteristics of the representations. As a result,
we find that representations associated...
Artificial Intelligence,Machine Learning,Computer Vision and Pattern Recognition,Audio and Speech Processing,Computation and Language,Sound