SEAL: Suite for Evaluating API-use of LLMs

Woojeong Kim,Ashish Jagmohan,Aditya Vempaty
2024-09-24
Abstract:Large language models (LLMs) have limitations in handling tasks that require real-time access to external APIs. While several benchmarks like ToolBench and APIGen have been developed to assess LLMs' API-use capabilities, they often suffer from issues such as lack of generalizability, limited multi-step reasoning coverage, and instability due to real-time API fluctuations. In this paper, we introduce SEAL, an end-to-end testbed designed to evaluate LLMs in real-world API usage. SEAL standardizes existing benchmarks, integrates an agent system for testing API retrieval and planning, and addresses the instability of real-time APIs by introducing a GPT-4-powered API simulator with caching for deterministic evaluations. Our testbed provides a comprehensive evaluation pipeline that covers API retrieval, API calls, and final responses, offering a reliable framework for structured performance comparison in diverse real-world scenarios. SEAL is publicly available, with ongoing updates for new benchmarks.
Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve several key problems faced by existing large - language models (LLMs) when handling tasks that require real - time access to external APIs. Specifically, the paper points out the following deficiencies in current evaluation benchmarks: 1. **Lack of generalization ability**: Many existing API - use benchmarks do not have sufficient holdout sets, which may lead to over - fitting problems. For example, some datasets do not have a clear training - test split, making it difficult to evaluate model performance. 2. **Limited multi - step reasoning coverage**: Most existing benchmarks mainly focus on single - step queries, ignoring the multi - tool and multi - step reasoning required in complex real - world scenarios. This limits the comprehensive evaluation of LLMs' ability to handle complex tasks. 3. **Instability and real - time API fluctuations**: Due to the dynamic characteristics of API services (such as changes in service definitions, changes in response behaviors, etc.), existing benchmarks are difficult to provide a stable and reliable evaluation environment. This instability affects the evaluation and standardization process of new systems. 4. **Incomplete evaluation**: Existing benchmarks often only focus on a certain part of the API - use process, ignoring the comprehensive evaluation of the entire pipeline, including API retrieval, API calls, and the accuracy of the final response. To solve the above problems, the paper introduces SEAL (Suite for Evaluating API - use of LLMs), an end - to - end test platform, which aims to improve the evaluation of LLMs' API - use ability in the following aspects: - **Standardize existing benchmarks**: SEAL integrates and standardizes multiple existing API - use benchmarks to ensure that data of different structures can be processed in a unified manner. - **Integrate proxy systems**: SEAL adopts an agent system built based on the AutoGen framework to test API retrieval and planning capabilities, improving flexibility and adaptability. - **Introduce a GPT - 4 - driven API simulator**: To deal with the instability of real - time APIs, SEAL has developed an API simulator supported by GPT - 4 and combined with a caching mechanism to achieve more deterministic evaluation results. - **Provide a complete evaluation framework**: SEAL covers all aspects from API retrieval, API calls to final response generation, providing a structured performance comparison framework, which is suitable for various practical application scenarios. In summary, SEAL aims to provide researchers with a more reliable, comprehensive, and easy - to - use tool for evaluating and improving LLMs' performance in real - world API interactions.