Abstract:As statistical approaches are getting increasingly used in linguistics, attention must be paid to the choice of methods and algorithms used. This is especially true since they require assumptions to be satisfied to provide valid results, and because scientific articles still often fall short of reporting whether such assumptions are met. Progress is being, however, made in various directions, one of them being the introduction of techniques able to model data that cannot be properly analyzed with simpler linear regression models. We report recent advances in statistical modeling in linguistics. We first describe linear mixed-effects regression models (LMM), which address grouping of observations, and generalized linear mixed-effects models (GLMM), which offer a family of distributions for the dependent variable. Generalized additive models (GAM) are then introduced, which allow modeling non-linear parametric or non-parametric relationships between the dependent variable and the predictors. We then highlight the possibilities offered by generalized additive models for location, scale, and shape (GAMLSS). We explain how they make it possible to go beyond common distributions, such as Gaussian or Poisson, and offer the appropriate inferential framework to account for 'difficult' variables such as count data with strong overdispersion. We also demonstrate how they offer interesting perspectives on data when not only the mean of the dependent variable is modeled, but also its variance, skewness, and kurtosis. As an illustration, the case of phonemic inventory size is analyzed throughout the article. For over 1,500 languages, we consider as predictors the number of speakers, the distance from Africa, an estimation of the intensity of language contact, and linguistic relationships. We discuss the use of random effects to account for genealogical relationships, the choice of appropriate distributions to model count data, and non-linear relationships. Relying on GAMLSS, we assess a range of candidate distributions, including the Sichel, Delaporte, Box-Cox Green and Cole, and Box-Cox t distributions. We find that the Box-Cox t distribution, with appropriate modeling of its parameters, best fits the conditional distribution of phonemic inventory size. We finally discuss the specificities of phoneme counts, weak effects, and how GAMLSS should be considered for other linguistic variables.

Evaluating generalised additive mixed modelling strategies for dynamic speech analysis

Generalised additive mixed models for dynamic analysis in linguistics: a practical introduction

Linear mixed effects models for non‐Gaussian continuous repeated measurement data

Modeling Linguistic Variables With Regression Models: Addressing Non-Gaussian Distributions, Non-independent Observations, and Non-linear Predictors With Random Effects and Generalized Additive Models for Location, Scale, and Shape

Generalized additive models with flexible response functions

Stable and Efficient Multiple Smoothing Parameter Estimation for Generalized Additive Models

Discriminative Dynamic Gaussian Mixture Selection with Enhanced Robustness and Performance for Multi-Accent Speech Recognition

Everything, altogether, all at once: Addressing data challenges when measuring speech intelligibility through entropy scores

Using Generalized Gaussian Distributions to Improve Regression Error Modeling for Deep Learning-Based Speech Enhancement.

An overview of mixture modelling for latent evolutions in longitudinal data: Modelling approaches, fit statistics and software

GMM-HMM Acoustic Model Training by a Two Level Procedure with Gaussian Components Determined by Automatic Model Selection

An introduction to modeling longitudinal data with generalized additive models: Applications to single-case designs.

Online Generalized Additive Model

Bayesian Semiparametric Longitudinal Drift-Diffusion Mixed Models for Tone Learning in Adults

Machine-learning applied to classify flow-induced sound parameters from simulated human voice

Fast Automatic Smoothing for Generalized Additive Models

Error Modeling Via Asymmetric Laplace Distribution for Deep Neural Network Based Single-Channel Speech Enhancement

Growth Mixture Modeling With Nonnormal Distributions: Implications for Data Transformation

How to analyze linguistic change using mixed models, Growth Curve Analysis and Generalized Additive Modeling

Statistical modelling of COVID-19 data: Putting generalized additive models to work