Abstract:This study proposes a linguistic classification method based on quantitative typology, which leverages a large-scale multilingual parallel corpus to obtain valid language classification result by excluding the influence of covariates such as text genre and semantic content in cross-language comparison. To achieve this, we model the type-token relationships of each Slavic parallel text and calculate the lexical diversity to approximate the morphological complexity of the language. We perform automatic clustering of languages based on these lexical diversity metrics. Our findings show that (1) the lexical diversity metrics can well reflect that the language is located somewhere on the continuum of 'analytism-synthetism'; (2) the automatic clustering based on these metrics effectively reflects the genealogical classification of Slavic languages; and (3) the geographical distribution of lexical diversity in the region where Slavic languages are spoken shows a monotonic increasing trend from southwest to northeast, which is consistent with the pattern found by previous authors on a global scale. The methodological approach taken in this study is data-driven, with the benefit of being independent of theoretical assumptions and easy for computer processing. This approach can offer a better insight into corpus-based typology and may shed light on the understanding of language as a human-driven complex adaptive system.

Classifying Syntactic Regularities for Hundreds of Languages

From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings

Linguistic Typology Features from Text: Inferring the Sparse Features of World Atlas of Language Structures

Language clusters based on linguistic complex networks

Lexical Diversity As a Lens into the Classification of Slavic Languages: A Quantitative Typology Perspective.

Reconstructing Native Language Typology from Foreign Language Usage

SIGTYP 2020 Shared Task: Prediction of Typological Features

Can syntactic networks indicate morphological complexity of a language?

Modeling Global Syntactic Variation in English Using Dialect Classification

Language Clustering with Word Co-Occurrence Networks Based on Parallel Texts

A Probabilistic Generative Model of Linguistic Typology

The Past, Present, and Future of Typological Databases in NLP

Quantitative Typological Analysis of Romance Languages

Stability of Syntactic Dialect Classification Over Space and Time

Preliminary lexicostatistics as a basis for language classification: a new approach

What Kind of Language Is Hard to Language-Model?

Cross-Linguistic Syntactic Evaluation of Word Prediction Models

Association Relationship Analyses Of Stylistic Syntactic Structures

Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks

Contrastive Analysis with Predictive Power: Typology Driven Estimation of Grammatical Error Distributions in ESL

From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars