APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

Zuxin Liu,Thai Hoang,Jianguo Zhang,Ming Zhu,Tian Lan,Shirley Kokane,Juntao Tan,Weiran Yao,Zhiwei Liu,Yihao Feng,Rithesh Murthy,Liangwei Yang,Silvio Savarese,Juan Carlos Niebles,Huan Wang,Shelby Heinecke,Caiming Xiong
2024-06-27
Abstract:The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains. The dataset is available on Huggingface: <a class="link-external link-https" href="https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k" rel="external noopener nofollow">this https URL</a> and the project homepage: <a class="link-external link-https" href="https://apigen-pipeline.github.io/" rel="external noopener nofollow">this https URL</a>
Computation and Language,Artificial Intelligence,Machine Learning,Software Engineering
What problem does this paper attempt to address?
The paper aims to address the challenges faced by large language models (LLMs) when performing function call tasks, particularly the issues related to the quality and diversity of the current training datasets. The paper proposes an automated pipeline named APIGen, designed to generate verifiable and diverse function call datasets. Through this approach, the researchers hope to enhance the performance of LLMs in real-world applications. Specifically, the main problems addressed by the paper include: 1. **Improving data quality**: Existing function call datasets often lack comprehensive validation, leading to potential inaccuracies or inefficiencies when models handle real-world application scenarios. 2. **Increasing data diversity**: To enable LLMs to better adapt to various APIs and application scenarios, it is necessary to create datasets that include a wide range of query types and APIs. 3. **Ensuring dataset scalability**: Designing a flexible and scalable data generation framework to easily integrate API data from different sources. To address the above issues, the paper contributes the following points: - **Proposing the APIGen framework**: This is an automated pipeline for generating high-quality, diverse function call datasets. It employs a multi-stage data validation process to ensure data accuracy and applicability. - **Developing and testing function call models**: Researchers used datasets generated by APIGen to train function call models of different scales and demonstrated their excellent performance on the Berkeley function call benchmark. - **Releasing a synthetic dataset**: The paper also publicly released a synthetic function call dataset containing 60,000 high-quality entries, including 3,673 APIs across 21 categories, aiming to promote further research and development in the field of function call agents. In summary, the goal of this paper is to improve the performance of LLMs in function call tasks by providing high-quality, diverse datasets, and to empirically demonstrate the effectiveness of the proposed solutions.