Abstract:Large language models (LLMs) have limitations in handling tasks that require real-time access to external APIs. While several benchmarks like ToolBench and APIGen have been developed to assess LLMs' API-use capabilities, they often suffer from issues such as lack of generalizability, limited multi-step reasoning coverage, and instability due to real-time API fluctuations. In this paper, we introduce SEAL, an end-to-end testbed designed to evaluate LLMs in real-world API usage. SEAL standardizes existing benchmarks, integrates an agent system for testing API retrieval and planning, and addresses the instability of real-time APIs by introducing a GPT-4-powered API simulator with caching for deterministic evaluations. Our testbed provides a comprehensive evaluation pipeline that covers API retrieval, API calls, and final responses, offering a reliable framework for structured performance comparison in diverse real-world scenarios. SEAL is publicly available, with ongoing updates for new benchmarks.

What problem does this paper attempt to address?

This paper attempts to solve several key problems faced by existing large - language models (LLMs) when handling tasks that require real - time access to external APIs. Specifically, the paper points out the following deficiencies in current evaluation benchmarks: 1. **Lack of generalization ability**: Many existing API - use benchmarks do not have sufficient holdout sets, which may lead to over - fitting problems. For example, some datasets do not have a clear training - test split, making it difficult to evaluate model performance. 2. **Limited multi - step reasoning coverage**: Most existing benchmarks mainly focus on single - step queries, ignoring the multi - tool and multi - step reasoning required in complex real - world scenarios. This limits the comprehensive evaluation of LLMs' ability to handle complex tasks. 3. **Instability and real - time API fluctuations**: Due to the dynamic characteristics of API services (such as changes in service definitions, changes in response behaviors, etc.), existing benchmarks are difficult to provide a stable and reliable evaluation environment. This instability affects the evaluation and standardization process of new systems. 4. **Incomplete evaluation**: Existing benchmarks often only focus on a certain part of the API - use process, ignoring the comprehensive evaluation of the entire pipeline, including API retrieval, API calls, and the accuracy of the final response. To solve the above problems, the paper introduces SEAL (Suite for Evaluating API - use of LLMs), an end - to - end test platform, which aims to improve the evaluation of LLMs' API - use ability in the following aspects: - **Standardize existing benchmarks**: SEAL integrates and standardizes multiple existing API - use benchmarks to ensure that data of different structures can be processed in a unified manner. - **Integrate proxy systems**: SEAL adopts an agent system built based on the AutoGen framework to test API retrieval and planning capabilities, improving flexibility and adaptability. - **Introduce a GPT - 4 - driven API simulator**: To deal with the instability of real - time APIs, SEAL has developed an API simulator supported by GPT - 4 and combined with a caching mechanism to achieve more deterministic evaluation results. - **Provide a complete evaluation framework**: SEAL covers all aspects from API retrieval, API calls to final response generation, providing a structured performance comparison framework, which is suitable for various practical application scenarios. In summary, SEAL aims to provide researchers with a more reliable, comprehensive, and easy - to - use tool for evaluating and improving LLMs' performance in real - world API interactions.

SEAL: Suite for Evaluating API-use of LLMs

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs

AppBench: Planning of Multiple APIs from Various APPs for Complex User Instruction

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Seal-Tools: Self-Instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark

ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents

A Comprehensive Framework for Evaluating API-oriented Code Generation in Large Language Models

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems

Octopus: On-device language model for function calling of software APIs

BEAPI: A Tool for Bounded Exhaustive Input Generation from APIs

Semantic API Alignment: Linking High-level User Goals to APIs

Beyond Text: Unveiling Multimodal Proficiency of Large Language Models with MultiAPI Benchmark

BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models

TaskBench: Benchmarking Large Language Models for Task Automation

APILOT: Navigating Large Language Models to Generate Secure Code by Sidestepping Outdated API Pitfalls