A systematic investigation of learnability from single child linguistic input

Yulu Qin,Wentao Wang,Brenden M. Lake
2024-05-11
Abstract:Language models (LMs) have demonstrated remarkable proficiency in generating linguistically coherent text, sparking discussions about their relevance to understanding human language learnability. However, a significant gap exists between the training data for these models and the linguistic input a child receives. LMs are typically trained on data that is orders of magnitude larger and fundamentally different from child-directed speech (Warstadt and Bowman, 2022; Warstadt et al., 2023; Frank, 2023a). Addressing this discrepancy, our research focuses on training LMs on subsets of a single child's linguistic input. Previously, Wang, Vong, Kim, and Lake (2023) found that LMs trained in this setting can form syntactic and semantic word clusters and develop sensitivity to certain linguistic phenomena, but they only considered LSTMs and simpler neural networks trained from just one single-child dataset. Here, to examine the robustness of learnability from single-child input, we systematically train six different model architectures on five datasets (3 single-child and 2 baselines). We find that the models trained on single-child datasets showed consistent results that matched with previous work, underscoring the robustness of forming meaningful syntactic and semantic representations from a subset of a child's linguistic input.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper primarily investigates the ability of language models (LMs) to learn from the language input of individual children. Current LMs are typically trained on large-scale, adult data that differs greatly from the language input actually received by children. Researchers train LMs to process the speech input of a single child to observe whether they can form meaningful syntactic and semantic representations. The paper first points out that despite being efficient language learners, the mechanisms of language acquisition in children remain a mystery. Researchers compared them with Transformer-based large-scale language models (LLMs), which excel in generating coherent texts, sparking discussions about whether they reflect human language learning mechanisms. Previous studies mainly focused on training models on multi-child datasets, whereas this paper concentrates on data from individual children. The research methodology involves systematic training of six different model architectures on five datasets (three individual child datasets and two benchmark datasets). The results indicate that regardless of the variations in model architecture or dataset, the models are able to form syntactic and semantic categories similar to previous studies from the input of a single child, demonstrating the robustness of this learning ability. The paper employed various evaluation methods, including language acceptability tests, word vector visualization, and cloze tests, to examine the models' performance in different settings. All models demonstrated consistency in distinguishing between nouns, verbs, and other word categories, as well as sensitivity to certain linguistic phenomena. However, they still struggle with more complex language phenomena such as subject-verb agreement. In summary, the paper aims to address the question of whether language models can learn meaningful syntactic and semantic structures solely from the language input of a single child and whether this learning ability is universal. The research findings indicate that despite challenges, models can indeed simulate to some extent the language learning process of children.