ChiSCor: A Corpus of Freely Told Fantasy Stories by Dutch Children for Computational Linguistics and Cognitive Science

Bram M.A. van Dijk,Max J. van Duijn,Suzan Verberne,Marco R. Spruit
2023-10-31
Abstract:In this resource paper we release ChiSCor, a new corpus containing 619 fantasy stories, told freely by 442 Dutch children aged 4-12. ChiSCor was compiled for studying how children render character perspectives, and unravelling language and cognition in development, with computational tools. Unlike existing resources, ChiSCor's stories were produced in natural contexts, in line with recent calls for more ecologically valid datasets. ChiSCor hosts text, audio, and annotations for character complexity and linguistic complexity. Additional metadata (e.g. education of caregivers) is available for one third of the Dutch children. ChiSCor also includes a small set of 62 English stories. This paper details how ChiSCor was compiled and shows its potential for future work with three brief case studies: i) we show that the syntactic complexity of stories is strikingly stable across children's ages; ii) we extend work on Zipfian distributions in free speech and show that ChiSCor obeys Zipf's law closely, reflecting its social context; iii) we show that even though ChiSCor is relatively small, the corpus is rich enough to train informative lemma vectors that allow us to analyse children's language use. We end with a reflection on the value of narrative datasets in computational linguistics.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the following issues: 1. **Research on Children's Language and Cognitive Development**: - Existing research mostly relies on standardized tests, which may not fully reflect children's actual language and cognitive abilities in natural environments. Therefore, a more ecologically valid dataset is needed to study how children express role perspectives and how they handle language and cognitive tasks during their development. 2. **Data Collection in Natural Contexts**: - Existing resources are usually collected in structured or laboratory environments, lacking data from natural social settings. The paper fills this gap by creating a dataset (ChiSCor) that includes freely narrated fantasy stories. 3. **Integration of Rich Metadata**: - Existing resources often only include basic information about children (such as age and gender), lacking more detailed background information (such as parents' education levels). ChiSCor includes this additional metadata, helping to provide a more comprehensive understanding of children's language and cognitive development. 4. **Exploration of Computational Tools for Language and Cognition**: - By using computational tools (such as dependency distance, word frequency distribution, etc.) to analyze the dataset, researchers can gain deeper insights into the complexity and regularity of children's language use, providing new perspectives for future language and cognitive research. ### Main Contributions of the Paper 1. **Release of the ChiSCor Dataset**: - Detailed description of the dataset compilation process, data content, and annotation information. 2. **Demonstration of ChiSCor's Potential Applications**: - Through three brief case studies, the potential applications of ChiSCor in the intersection of language, cognition, and computation are demonstrated: - **Case Study 1**: Explored the syntactic complexity in children's stories, finding remarkable stability in syntactic complexity across different age groups. - **Case Study 2**: Extended research on Zipf's law in spontaneous speech, discovering that ChiSCor's word frequency distribution is closer to Zipf's law than written corpora. - **Case Study 3**: Showed that even though ChiSCor is a relatively small dataset, it can be used to train meaningful word vectors, thus analyzing children's language use. 3. **Emphasis on the Value of Narrative Datasets**: - Highlighted the importance of narrative datasets in computational linguistics, especially in building more ecologically valid language models. Through these contributions, the paper not only provides a new data resource for research on children's language and cognitive development but also demonstrates how computational tools can be used to deeply analyze these data, thereby promoting further development in the related fields.