Abstract:The study aims to identify (1) morphological complexity predictors and (2) domain inherent markers able to differentiate subject areas of academic text in Russian. The total size of the corpus, including textbooks on biology and social studies of three levels of complexity, corresponding to 6-7, 8-9, and 10-11 grades of the Russian school, amounted to 941963 tokens. The linguistic complexity of the texts was assessed using the Flesch-Kincaid readability formula modified for the Russian language, and the interdependence of the parameters was measured based on the correlation analysis conducted with STATISTICA. Calculation of linguistic parameters values, including distribution of nouns, adjectives, verbs, and readability index, were performed using RuLingva (rulex.kpfu.ru/), a text profiler for the Russian language, while the frequency metrics of deverbatives and deadjectives were identified by the contributors manually. To ensure comparability of the metrics, the distributional analysis of deverbatives and deadjectives was performed in the corpus normalized to 10000 tokens. Metrics “noun distribution”, “lexical density”, “deverbation”, “deadjectivation” demonstrated linear interdependence with readability and as such can be viewed as complexity predictors. Inverse correlation was revealed between text readability and verb distribution. Morphological analysis confirmed a high level of texts nominativity and a stable growth of substantives frequency. The latter explicates in an increase in the frequency of deverbation and deadjectivation suffixes in texts from the 6th to the 11th grade. Metrics of lexical density, adjective distribution and substantive suffixes demonstrate ability to discriminate academic texts domains. The research findings are applicable in text analytics, computational linguistics, genre studies, and can be useful for test developers and textbook writers. The authors view the research prospect in the study of compounds of Latin and Greek origin in academic texts. The identified parameters may be used as linguistic complexity predictors and domain discriminants.

Linguistic complexity: English vs. Polish, text vs. corpus

Approaching the linguistic complexity

Complex network analysis of literary and scientific texts

Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study in Polish

Functionally-defined recurrent multi-word units in English-to-Polish translation

Subtlex-pl: subtitle-based word frequency estimates for Polish

Is language evolution grinding to a halt? The scaling of lexical turbulence in English fiction suggests it is not

Entropy in Different Text Types.

Polish and English wordnets -- statistical analysis of interconnected networks

Polish–English bilingual children overuse referential markers: MLU inflation in Polish-language narratives

Scaling laws in human speech, decreasing emergence of new words and a generalized model

Complex systems approach to natural language

Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora

Morphology and Word Order in Slavic Languages: Insights from Annotated Corpora

Text complexity increase in Russian texts as a function of morphological changes

Rank diversity of languages: Generic behavior in computational linguistics

Multifractal analysis of sentence lengths in English literary texts

Selected polite expressions in contemporary Polish written in Ukraine (against the background of the Polish nationwide standard)

Deviation of Zipf's and Heaps' Laws in Human Languages with Limited Dictionary Sizes

Rank-frequency distribution of natural languages: a difference of probabilities approach

The Structural Complexity of Chinese Words and Its Relationship with Word Frequency.