Abstract:Abstract Summary Pretrained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by >15% in terms of Pass@K under certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (i) Successful models accommodate a long prompt (>2600 tokens) with full context, including functional dependencies. (ii) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50% versus up to 25%). Availability and implementation All datasets, benchmark, Docker images, and scripts required for testing are available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark.github.io/.

OmniGenBench: Automating Large-scale in-silico Benchmarking for Genomic Foundation Models

GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models

ProteinInvBench: Benchmarking Protein Inverse Folding on Diverse Tasks, Models, and Metrics.

Diverse Genomic Embedding Benchmark for functional evaluation across the tree of life

GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models

3DGenBench: a web-server to benchmark computational models for 3D Genomics

ProteinBench: A Holistic Evaluation of Protein Foundation Models

Genomics-FM: Universal Foundation Model for Versatile and Data-Efficient Functional Genomic Analysis

Codabench: Flexible, Easy-to-use, and Reproducible Meta-Benchmark Platform

GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians

BEND: Benchmarking DNA Language Models on biologically meaningful tasks

Benchmarking DNA Foundation Models for Genomic Sequence Classification

BioCoder: a benchmark for bioinformatics code generation with large language models

BioLLM: A Standardized Framework for Integrating and Benchmarking Single-Cell Foundation Models

Omnibenchmark (alpha) for continuous and open benchmarking in bioinformatics

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction

COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models