Abstract:Abstract The emerging field of Genome-NLP (Natural Language Processing) aims to analyse biological sequence data using machine learning (ML), offering significant advancements in data-driven diagnostics. Three key challenges exist in Genome-NLP. First, long biomolecular sequences require “tokenisation” into smaller subunits, which is non-trivial since many biological “words” remain unknown. Second, ML methods are highly nuanced, reducing interoperability and usability. Third, comparing models and reproducing results are difficult due to the large volume and poor quality of biological data. To tackle these challenges, we developed the first automated Genome-NLP workflow that integrates feature engineering and ML techniques. The workflow is designed to be species and sequence agnostic. In this workflow: a) We introduce a new transformer-based model for genomes called genomicBERT , which empirically tokenises sequences while retaining biological context. This approach minimises manual preprocessing, reduces vocabulary sizes, and effectively handles out-of-vocabulary “words”. (b) We enable the comparison of ML model performance even in the absence of raw data. To facilitate widespread adoption and collaboration, we have made genomicBERT available as part of the publicly accessible conda package called genomeNLP . We have successfully demonstrated the application of genomeNLP on multiple case studies, showcasing its effectiveness in the field of Genome-NLP. Highlights We provide a comprehensive classification of genomic data tokenisation and representation approaches for ML applications along with their pros and cons. We infer k-mers directly from the data and handle out-of-vocabulary words. At the same time, we achieve a significantly reduced vocabulary size compared to the conventional k-mer approach reducing the computational complexity drastically. Our method is agnostic to species or biomolecule type as it is data-driven. We enable comparison of trained model performance without requiring original input data, metadata or hyperparameter settings. We present the first publicly available, high-level toolkit that infers the grammar of genomic data directly through artificial neural networks. Preprocessing, hyperparameter sweeps, cross validations, metrics and interactive visualisations are automated but can be adjusted by the user as needed.

GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models

OmniGenBench: Automating Large-scale in-silico Benchmarking for Genomic Foundation Models

Does your model understand genes? A benchmark of gene properties for biological and text models

Diverse Genomic Embedding Benchmark for functional evaluation across the tree of life

GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models

GuacaMol: Benchmarking Models for de Novo Molecular Design

Benchmarking DNA Foundation Models for Genomic Sequence Classification

Enformation Theory: A Framework for Evaluating Genomic AI

GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians

Genomics-FM: Universal Foundation Model for Versatile and Data-Efficient Functional Genomic Analysis

A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language

The TCGA Meta-Dataset Clinical Benchmark

Artificial intelligence-driven biomedical genomics

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

genomicBERT and data-free deep-learning model evaluation

NCBench: providing an open, reproducible, transparent, adaptable, and continuous benchmark approach for DNA-sequencing-based variant calling

DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

BEND: Benchmarking DNA Language Models on biologically meaningful tasks

3DGenBench: a web-server to benchmark computational models for 3D Genomics