Abstract:Evolutionary biologists, primarily palaeoanthropologists, anatomists and ontogenists, employ modern geometric morphometrics to quantitatively analyse physical forms (e.g., skull morphology) and explore relationships, variations, and differences between samples and taxa using landmark coordinates. The standard approach comprises two steps: Generalised Procrustes Analysis (GPA) followed by Principal Component Analysis (PCA). PCA projects the superimposed data produced by GPA onto a set of uncorrelated variables, which can be visualised on scatterplots and used to draw phenetic, evolutionary, and ontogenetic conclusions. Recently, the use of PCA in genetic studies has been challenged. Due to PCA's central role in morphometrics, we sought to evaluate the standard approach and claims based on PCA outcomes. To test PCA's accuracy, robustness, and reproducibility using benchmark data of the crania of five papionin genera, we developed MORPHIX, a Python package for processing superimposed landmark data with classifier and outlier detection methods, which can be further visualised using various plots. Throughout this manuscript, we address the recent and contentious use of PCA in physical anthropology and phylogenetic inference, such as the case of Homo Nesher Ramla, an archaic hominin with a questionable taxonomy. We found that PCA outcomes are artefacts of the input data and are neither reliable, robust, nor reproducible as field members may assume. We also found that supervised machine learning classifiers are more accurate both for classification and detecting new taxa. Our findings raise concerns about PCA-based findings applied in 18,400 to 35,200 Physical anthropology studies. Our work can be used to evaluate prior and novel claims concerning the origins and relatedness of inter- and intra-species and improve phylogenetic and taxonomic reconstructions.

Principal Component Analysis as a Sanity Check for Bayesian Phylolinguistic Reconstruction

Non-Parametric Bayesian Areal Linguistics

Gaussian Tree Constraints Applied to Acoustic Linguistic Functional Data

Are Sounds Sound for Phylogenetic Reconstruction?

Principal component analysis and the locus of the Frechet mean in the space of phylogenetic trees

Automating Sound Change Prediction for Phylogenetic Inference: A Tukanoan Case Study

Biases of Principal Component Analysis (PCA) in Physical Anthropology Studies Require a Reevaluation of Evolutionary Insights

A simple branching model that reproduces language family and language population distributions

A Phylogenetic Model of the Evolution of Discrete Matrices for the Joint Inference of Lexical and Phonological Language Histories

Bayesian Inference on Principal Component Analysis Using Reversible Jump Markov Chain Monte Carlo.

Principal Component Analyses in Anthropological Genetics

Bayesian Modeling of Language-Evoked Event-Related Potentials

Principal components variable importance reconstruction (PC-VIR): Exploring predictive importance in multicollinear acoustic speech data

Reanalyzing L2 Preposition Learning with Bayesian Mixed Effects and a Pretrained Language Model

Markov Chain Monte-Carlo Phylogenetic Inference Construction in Computational Historical Linguistics

A Probabilistic Generative Model of Linguistic Typology

Phylogenetics of Indo-European Language families via an Algebro-Geometric Analysis of their Syntactic Structures

Quantitative methods for Phylogenetic Inference in Historical Linguistics: An experimental case study of South Central Dravidian

Progress on Constructing Phylogenetic Networks for Languages

Approach to the Correlation Discovery of Chinese Linguistic Parameters Based on Bayesian Method

A Variational Approach to Bayesian Phylogenetic Inference