ML-Bench: Large Language Models Leverage Open-source Libraries for
Machine Learning Tasks
Y. Liu,Xiangru Tang,Zerong Cai,Jiangang Lu,Yichi Zhang,Yina Shao,Zexuan Deng,He Hu,Zhaoming Yang,Kaikai An,Ruilin Huang,Shuzheng Si,Sheng Chen,Haiyang Zhao,Zhengliang Li,Chao Liang,Yanxiang Zong,Qianqian Wang,Tianyu Li,Zhiwei Jiang,Baobao Chang,Yujia Qin,Wangchunshu Zhou,Yilun Zhao,Arman Cohan,Mark Gerstein
DOI: https://doi.org/10.48550/arxiv.2311.09835
2023-01-01
Abstract:Large language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, an expansive benchmark developed to assess the effectiveness of LLMs in leveraging existing functions in open-source libraries. Consisting of 10044 samples spanning 130 tasks over 14 notable machine learning GitHub repositories. In this setting, given a specific machine learning task instruction and the accompanying README in a codebase, an LLM is tasked to generate code to accomplish the task. This necessitates the comprehension of long and language-code interleaved documents, as well as the understanding of complex cross-file code structures, introducing new challenges. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it manages to accomplish only 39.73\% of the tasks, leaving a huge space for improvement. We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements. Code, data, and models are available at \url{https://ml-bench.github.io/}.