Abstract:In the computational age, life-scientists often have to write Python code to solve bio-image analysis (BIA) problems. Many of them have not been formally trained in programming though. Code-generation, or coding assistance in general, with Large Language Models (LLMs) can have a clear impact on BIA. To the best of our knowledge, the quality of the generated code in this domain has not been studied.We present a quantitative benchmark to estimate the capability of LLMs to generate code for solving common BIA tasks. Our benchmark currently consists of 57 human-written prompts with corresponding reference solutions in Python, and unit-tests to evaluate functional correctness of potential solutions. We demonstrate our benchmark here and compare 18 state-of-the-art LLMs. To ensure that we will cover most of our community needs we also outline mid- and long-term strategies to maintain and extend the benchmark by the BIA open-source community. This work should support users in deciding for an LLM and also guide LLM developers in improving the capabilities of LLMs in the BIA domain.

What problem does this paper attempt to address?

The paper attempts to address the problem of evaluating the performance of large language models (LLMs) in code generation for biological image analysis (BIA). Specifically, the authors propose a quantitative benchmarking method to estimate the ability of LLMs to generate code for common BIA tasks. This benchmark includes 57 human-written prompts and their corresponding reference solutions, and provides unit tests to assess the functional correctness of potential solutions. By comparing 18 state-of-the-art LLMs, the authors aim to support users in selecting the appropriate LLM and guide LLM developers in improving their capabilities in the BIA domain. ### Main Objectives: 1. **Evaluate the code generation capability of LLMs**: Assess the quality of code generated by LLMs for biological image analysis tasks through quantitative benchmarking. 2. **Provide decision support**: Help users decide which LLM to use for developing biological image analysis scripts and tools. 3. **Guide LLM development**: Provide a metric for LLM developers to guide their further development in the BIA domain. 4. **Community engagement**: Encourage the community to contribute new test cases to ensure the benchmark covers a wide range of needs. ### Methodology: - **Benchmark dataset**: Contains 57 human-written functions and their docstrings, each with a corresponding reference solution and unit tests. - **Evaluation metrics**: Use the pass@k metric (specifically pass@1) to evaluate the probability that the code generated by LLMs passes the unit tests at least once in multiple attempts. - **Hardware and software environment**: Run tests in different hardware and software environments, including commercial models and open-source models. ### Results: - **Performance comparison**: Showcased the performance of different LLMs in the benchmark, with some models like claude-3-5-sonnet-20240620, gpt-4o-2024-05-13, and gpt-4-turbo-2024-04-09 performing well. - **Error analysis**: Analyzed common error messages in the code generated by LLMs, revealing systematic differences between models. - **Library usage**: Summarized the Python libraries used in the generated code, identifying trends that differ from human-written reference code. ### Discussion: - **Bias and improvements**: Discussed potential biases introduced by the selection of test cases and proposed future improvements, including adding more real-world test cases and considering the evaluation of visual models. - **Community-driven**: Emphasized the importance of community involvement, encouraging users to submit new test cases via Pull Requests to ensure the comprehensiveness and practicality of the benchmark. ### Conclusion: The authors developed a benchmark to evaluate the performance of LLMs in code generation for biological image analysis. This benchmark can help researchers select the appropriate LLM and provide guidance for LLM developers. Finally, the authors encourage active community participation to ensure the benchmark covers a wide range of domain needs.

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models

Benchmarking Large Language Models in Evidence-Based Medicine

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

An Evaluation of Large Language Models in Bioinformatics Research

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

Benchmarking Biomedical Relation Knowledge in Large Language Models

From Code to Play: Benchmarking Program Search for Games Using Large Language Models

CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

VerilogEval: Evaluating Large Language Models for Verilog Code Generation

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

Large Language Model Benchmarks in Medical Tasks

Evaluating Large Language Models in Class-Level Code Generation

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code