Analyzing learner language: the case of the Hebrew Learner Essay Corpus
Chen Gafni,Livnat Herzig Sheinfux,Hadar Klunover,Anat Bar Siman Tov,Anat Prior,Shuly Wintner
DOI: https://doi.org/10.1007/s10579-023-09712-w
2024-05-17
Language Resources and Evaluation
Abstract:We present the Hebrew Learner Essay Corpus (HELEECS): an annotated corpus of Hebrew language argumentative essays authored by prospective higher-education students. The corpus includes essays by two main populations: (1) essays by native speakers of Hebrew, written as part of the psychometric exam that is used to assess their future success in academic studies; (2) essays by non-native speakers of Hebrew, with three different native languages (Arabic, French, and Russian), that were written as part of a language aptitude test. The corpus is uniformly encoded and stored. The non-native essays were annotated with target hypotheses (i.e., hypothesized intended formulations in standard written Hebrew). The corpus is available for research purposes upon request. We describe the corpus and the error correction and annotation schemes used in its analysis. In addition to introducing this new resource, we discuss the challenges of identifying and analyzing non-native language use. Among these challenges are determining whether the language used in a particular utterance is native-like, and determining the target hypothesis when language use is non-native-like. We propose various ways for dealing with these challenges.
computer science, interdisciplinary applications