RuSemShift: a dataset of historical lexical semantic change in Russian

Julia Rodina,Andrey Kutuzov
DOI: https://doi.org/10.48550/arXiv.2010.06436
2020-10-13
Abstract:We present RuSemShift, a large-scale manually annotated test set for the task of semantic change modeling in Russian for two long-term time period pairs: from the pre-Soviet through the Soviet times and from the Soviet through the post-Soviet times. Target words were annotated by multiple crowd-source workers. The annotation process was organized following the DURel framework and was based on sentence contexts extracted from the Russian National Corpus. Additionally, we report the performance of several distributional approaches on RuSemShift, achieving promising results, which at the same time leave room for other researchers to improve.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to model lexical semantic change in Russian over a long historical period. Specifically, the authors created a large - scale manually - annotated test set named RuSemShift for studying Russian lexical semantic change from the Tsarist Russia period to the Soviet period and then to the post - Soviet period. This work fills the gap in the Russian language in this area, because most previous similar studies mainly focused on English and lacked large - scale manually - annotated data sets to evaluate the effectiveness of semantic change detection systems. ### Main contributions of the paper: 1. **Dataset construction**: RuSemShift is the first historical semantic change dataset annotated according to the DURel framework, using a large - scale crowdsourcing platform rather than the personal intuition of individual researchers. This makes the dataset more objective and reliable. 2. **Time span**: The dataset covers three main time periods: the Tsarist Russia period (1682 - 1916), the Soviet period (1918 - 1990), and the post - Soviet period (1991 - 2017). This long - term coverage helps to capture semantic changes in the language as society changes. 3. **Annotation method**: The DURel framework was used for annotation, which quantifies the degree of semantic change by comparing the correlation of word usage in different time periods. A 4 - point scale was used in the annotation process, with five levels from "undecidable" to "exactly the same". 4. **Performance evaluation**: The paper also reported the performance of several distribution models on RuSemShift. These models are based on static and contextualized embeddings (such as word2vec and ELMo). The results showed some promising results, but also pointed out room for improvement. ### Main technical details: - **Data source**: The Russian National Corpus (RNC) was used, which contains Russian texts of various genres from the mid - 18th century to the early 21st century. - **Vocabulary selection**: Words that may have undergone semantic change were selected by hand, and some "filler words" or "distractor words" were randomly sampled to evaluate the performance of the system. - **Annotation process**: Annotation was carried out on the Yandex.Toloka crowdsourcing platform, and the quality of annotators was ensured through various filters, for example, only users whose mother tongue is Russian and are over 30 years old were allowed to participate. - **Evaluation metrics**: Krippendorff’s α was used to measure the agreement among annotators, and two main metrics were provided: ΔLATER (reflecting the average correlation difference of word usage in two time periods) and COMPARE (reflecting the average correlation of word usage across time periods). ### Conclusion: The RuSemShift dataset provides an important resource for the study of Russian lexical semantic change, which is helpful not only for academic research but also applicable in fields such as sociolinguistics. Through this dataset, researchers can more accurately evaluate and improve semantic change detection algorithms.