Benchmarking Large Language Models for Bio-Image Analysis Code Generation

Robert Haase,Christian Tischer,Jean-Karim Hériché,Nico Scherf
DOI: https://doi.org/10.1101/2024.04.19.590278
2024-07-04
Abstract:In the computational age, life-scientists often have to write Python code to solve bio-image analysis (BIA) problems. Many of them have not been formally trained in programming though. Code-generation, or coding assistance in general, with Large Language Models (LLMs) can have a clear impact on BIA. To the best of our knowledge, the quality of the generated code in this domain has not been studied.We present a quantitative benchmark to estimate the capability of LLMs to generate code for solving common BIA tasks. Our benchmark currently consists of 57 human-written prompts with corresponding reference solutions in Python, and unit-tests to evaluate functional correctness of potential solutions. We demonstrate our benchmark here and compare 18 state-of-the-art LLMs. To ensure that we will cover most of our community needs we also outline mid- and long-term strategies to maintain and extend the benchmark by the BIA open-source community. This work should support users in deciding for an LLM and also guide LLM developers in improving the capabilities of LLMs in the BIA domain.
Bioinformatics
What problem does this paper attempt to address?
The paper attempts to address the problem of evaluating the performance of large language models (LLMs) in code generation for biological image analysis (BIA). Specifically, the authors propose a quantitative benchmarking method to estimate the ability of LLMs to generate code for common BIA tasks. This benchmark includes 57 human-written prompts and their corresponding reference solutions, and provides unit tests to assess the functional correctness of potential solutions. By comparing 18 state-of-the-art LLMs, the authors aim to support users in selecting the appropriate LLM and guide LLM developers in improving their capabilities in the BIA domain. ### Main Objectives: 1. **Evaluate the code generation capability of LLMs**: Assess the quality of code generated by LLMs for biological image analysis tasks through quantitative benchmarking. 2. **Provide decision support**: Help users decide which LLM to use for developing biological image analysis scripts and tools. 3. **Guide LLM development**: Provide a metric for LLM developers to guide their further development in the BIA domain. 4. **Community engagement**: Encourage the community to contribute new test cases to ensure the benchmark covers a wide range of needs. ### Methodology: - **Benchmark dataset**: Contains 57 human-written functions and their docstrings, each with a corresponding reference solution and unit tests. - **Evaluation metrics**: Use the pass@k metric (specifically pass@1) to evaluate the probability that the code generated by LLMs passes the unit tests at least once in multiple attempts. - **Hardware and software environment**: Run tests in different hardware and software environments, including commercial models and open-source models. ### Results: - **Performance comparison**: Showcased the performance of different LLMs in the benchmark, with some models like claude-3-5-sonnet-20240620, gpt-4o-2024-05-13, and gpt-4-turbo-2024-04-09 performing well. - **Error analysis**: Analyzed common error messages in the code generated by LLMs, revealing systematic differences between models. - **Library usage**: Summarized the Python libraries used in the generated code, identifying trends that differ from human-written reference code. ### Discussion: - **Bias and improvements**: Discussed potential biases introduced by the selection of test cases and proposed future improvements, including adding more real-world test cases and considering the evaluation of visual models. - **Community-driven**: Emphasized the importance of community involvement, encouraging users to submit new test cases via Pull Requests to ensure the comprehensiveness and practicality of the benchmark. ### Conclusion: The authors developed a benchmark to evaluate the performance of LLMs in code generation for biological image analysis. This benchmark can help researchers select the appropriate LLM and provide guidance for LLM developers. Finally, the authors encourage active community participation to ensure the benchmark covers a wide range of domain needs.