API Pack: A Massive Multi-Programming Language Dataset for API Call Generation

Zhen Guo,Adriana Meza Soria,Wei Sun,Yikang Shen,Rameswar Panda
2024-06-04
Abstract:We introduce API Pack, a massive multi-programming language dataset containing more than 1 million instruction-API call pairs to improve the API call generation capabilities of large language models. By fine-tuning CodeLlama-13B on 20,000 Python instances from API Pack, we enable it to outperform GPT-3.5 and GPT-4 in generating unseen API calls. Fine-tuning on API Pack also facilitates cross-programming language generalization by leveraging a large amount of data in one language and small amounts of data from other languages. Scaling the training data to 1 million instances further improves the model's ability to generalize to new APIs not used in training. To facilitate further research, we open-source the API Pack dataset, trained model, and associated source code at <a class="link-external link-https" href="https://github.com/zguo0525/API-Pack" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the time-consuming and tedious task of finding application programming interface (API) call code examples during software development. Specifically, the goals of the paper include: 1. **Creating the API Pack Dataset**: This is a large-scale multi-programming language dataset containing over 1 million instruction-API call pairs, aimed at improving the ability of large language models in API call generation. 2. **Enhancing API Call Generation Capability**: By fine-tuning the CodeLlama-13B model on the API Pack dataset, the model surpasses GPT-3.5 and GPT-4 in generating unseen API calls. 3. **Promoting Cross-Language Generalization**: Utilizing a large amount of data in one language and a small amount in other languages to achieve cross-language skill transfer, meaning improvements gained in one language can be applied to others. 4. **Improving Generalization to New APIs**: By increasing the training data volume to the level of 1 million instances, further enhancing the model's ability to generalize to new APIs. The paper validates the effectiveness of the above goals through experiments and publicly releases the API Pack dataset, trained models, and related source code to promote further research and development. Additionally, the paper discusses comparisons with existing work, details of the dataset construction process, and analysis of experimental results.