Abstract:The study aims to identify (1) morphological complexity predictors and (2) domain inherent markers able to differentiate subject areas of academic text in Russian. The total size of the corpus, including textbooks on biology and social studies of three levels of complexity, corresponding to 6-7, 8-9, and 10-11 grades of the Russian school, amounted to 941963 tokens. The linguistic complexity of the texts was assessed using the Flesch-Kincaid readability formula modified for the Russian language, and the interdependence of the parameters was measured based on the correlation analysis conducted with STATISTICA. Calculation of linguistic parameters values, including distribution of nouns, adjectives, verbs, and readability index, were performed using RuLingva (rulex.kpfu.ru/), a text profiler for the Russian language, while the frequency metrics of deverbatives and deadjectives were identified by the contributors manually. To ensure comparability of the metrics, the distributional analysis of deverbatives and deadjectives was performed in the corpus normalized to 10000 tokens. Metrics “noun distribution”, “lexical density”, “deverbation”, “deadjectivation” demonstrated linear interdependence with readability and as such can be viewed as complexity predictors. Inverse correlation was revealed between text readability and verb distribution. Morphological analysis confirmed a high level of texts nominativity and a stable growth of substantives frequency. The latter explicates in an increase in the frequency of deverbation and deadjectivation suffixes in texts from the 6th to the 11th grade. Metrics of lexical density, adjective distribution and substantive suffixes demonstrate ability to discriminate academic texts domains. The research findings are applicable in text analytics, computational linguistics, genre studies, and can be useful for test developers and textbook writers. The authors view the research prospect in the study of compounds of Latin and Greek origin in academic texts. The identified parameters may be used as linguistic complexity predictors and domain discriminants.

Subjective Assessment of Text Complexity: A Dataset for German Language

A Corpus for Automatic Readability Assessment and Text Simplification of German

Klexikon: A German Dataset for Joint Summarization and Simplification

Text complexity increase in Russian texts as a function of morphological changes

CompLex: A New Corpus for Lexical Complexity Prediction from Likert Scale Data

Digital Comprehensibility Assessment of Simplified Texts among Persons with Intellectual Disabilities

Syntactic Complexity Development in the Writings of EFL Learners: Insights from a Dependency Syntactically-Annotated Corpus

DEPLAIN: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification

EASSE-DE: Easier Automatic Sentence Simplification Evaluation for German

Document-Level Text Simplification - Dataset, Criteria and Baseline.

A Transfer Learning Based Model for Text Readability Assessment in German

A Readable Read: Automatic Assessment of Language Learning Materials based on Linguistic Complexity

German Text Simplification: Finetuning Large Language Models with Semi-Synthetic Data

A practical approach to language complexity: a Wikipedia case study

Japanese Lexical Complexity for Non-Native Readers: A New Dataset

Topic Modeling for Text Structure Assessment: The case of Russian Academic Texts

Data and Approaches for German Text simplification -- towards an Accessibility-enhanced Communication

Comparative analysis of word embeddings in assessing semantic similarity of complex sentences

Lexical Complexity Prediction: An Overview

Quantifying Syntactic Complexity in Czech Texts: an Analysis of Mean Dependency Distance and Average Sentence Length Across Genres

Evaluating LLMs for Targeted Concept Simplification for Domain-Specific Texts