Abstract:Deep learning has made rapid advances in modeling molecular sequencing data. Despite achieving high performance on benchmarks, it remains unclear to what extent deep learning models learn general principles and generalize to previously unseen sequences. Benchmarks traditionally interrogate model generalizability by generating metadata based (MB) or sequence-similarity based (SB) train and test splits of input data before assessing model performance. Here, we show that this approach mischaracterizes model generalizability by failing to consider the full spectrum of cross-split overlap, , similarity between train and test splits. We introduce SPECTRA, a spectral framework for comprehensive model evaluation. For a given model and input data, SPECTRA plots model performance as a function of decreasing cross-split overlap and reports the area under this curve as a measure of generalizability. We apply SPECTRA to 18 sequencing datasets with associated phenotypes ranging from antibiotic resistance in tuberculosis to protein-ligand binding to evaluate the generalizability of 19 state-of-the-art deep learning models, including large language models, graph neural networks, diffusion models, and convolutional neural networks. We show that SB and MB splits provide an incomplete assessment of model generalizability. With SPECTRA, we find as cross-split overlap decreases, deep learning models consistently exhibit a reduction in performance in a task- and model-dependent manner. Although no model consistently achieved the highest performance across all tasks, we show that deep learning models can generalize to previously unseen sequences on specific tasks. SPECTRA paves the way toward a better understanding of how foundation models generalize in biology.

The Effect of Data Partitioning Strategy on Model Generalizability: A Case Study of Morphological Segmentation

On How Data Are Partitioned in Model Development and Evaluation: Confronting the Elephant in the Room to Enhance Model Generalization.

Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

An R Package to Partition Observation Data Used for Model Development and Evaluation to Achieve Model Generalizability

An Empirical Study of Factors Affecting Language-Independent Models

Robust Generalization Strategies for Morpheme Glossing in an Endangered Language Documentation Context

The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning

On the Impact of Cross-Domain Data on German Language Models

How to Split: the Effect of Word Segmentation on Gender Bias in Speech Translation

Evaluating generalizability of artificial intelligence models for molecular datasets

Why do language models perform worse for morphologically complex languages?

Mitigating Data Scarcity for Large Language Models

Understanding Compositional Data Augmentation in Typologically Diverse Morphological Inflection

On the Diversity of Synthetic Data and its Impact on Training Large Language Models

An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable

Scaling Parameter-Constrained Language Models with Quality Data

Morphological Inflection: A Reality Check

How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse

We're Calling an Intervention: Exploring the Fundamental Hurdles in Adapting Language Models to Nonstandard Text