API Pack: A Massive Multi-Programming Language Dataset for API Call Generation

Zhen Guo,Adriana Meza Soria,Wei Sun,Yikang Shen,Rameswar Panda

2024-06-04

Abstract:We introduce API Pack, a massive multi-programming language dataset containing more than 1 million instruction-API call pairs to improve the API call generation capabilities of large language models. By fine-tuning CodeLlama-13B on 20,000 Python instances from API Pack, we enable it to outperform GPT-3.5 and GPT-4 in generating unseen API calls. Fine-tuning on API Pack also facilitates cross-programming language generalization by leveraging a large amount of data in one language and small amounts of data from other languages. Scaling the training data to 1 million instances further improves the model's ability to generalize to new APIs not used in training. To facilitate further research, we open-source the API Pack dataset, trained model, and associated source code at <a class="link-external link-https" href="https://github.com/zguo0525/API-Pack" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper aims to address the time-consuming and tedious task of finding application programming interface (API) call code examples during software development. Specifically, the goals of the paper include: 1. **Creating the API Pack Dataset**: This is a large-scale multi-programming language dataset containing over 1 million instruction-API call pairs, aimed at improving the ability of large language models in API call generation. 2. **Enhancing API Call Generation Capability**: By fine-tuning the CodeLlama-13B model on the API Pack dataset, the model surpasses GPT-3.5 and GPT-4 in generating unseen API calls. 3. **Promoting Cross-Language Generalization**: Utilizing a large amount of data in one language and a small amount in other languages to achieve cross-language skill transfer, meaning improvements gained in one language can be applied to others. 4. **Improving Generalization to New APIs**: By increasing the training data volume to the level of 1 million instances, further enhancing the model's ability to generalize to new APIs. The paper validates the effectiveness of the above goals through experiments and publicly releases the API Pack dataset, trained models, and related source code to promote further research and development. Additionally, the paper discusses comparisons with existing work, details of the dataset construction process, and analysis of experimental results.

API Pack: A Massive Multi-Programming Language Dataset for API Call Generation

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs

Gorilla: Large Language Model Connected with Massive APIs

Beyond Text: Unveiling Multimodal Proficiency of Large Language Models with MultiAPI Benchmark

OctoPack: Instruction Tuning Code Large Language Models

A Comprehensive Framework for Evaluating API-oriented Code Generation in Large Language Models

GPTZoo: A Large-scale Dataset of GPTs for the Research Community

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Code Generation for Collectible Card Games with Complex APIs

AppBench: Planning of Multiple APIs from Various APPs for Complex User Instruction

ToolCoder: Teach Code Generation Models to use API search tools

The Stack: 3 TB of permissively licensed source code

RestGPT: Connecting Large Language Models with Real-World RESTful APIs

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Harnessing LLMs for API Interactions: A Framework for Classification and Synthetic Data Generation

ToolACE: Winning the Points of LLM Function Calling

ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration

Are Human Rules Necessary? Generating Reusable APIs with CoT Reasoning and In-Context Learning

ComPile: A Large IR Dataset from Production Sources

Deep API Learning