Abstract:Large Language Models (LLMs) have shown remarkable capabilities in various domains, yet their economic impact has been limited by challenges in tool use and function calling. This paper introduces ThorV2, a novel architecture that significantly enhances LLMs' function calling abilities. We develop a comprehensive benchmark focused on HubSpot CRM operations to evaluate ThorV2 against leading models from OpenAI and Anthropic. Our results demonstrate that ThorV2 outperforms existing models in accuracy, reliability, latency, and cost efficiency for both single and multi-API calling tasks. We also show that ThorV2 is far more reliable and scales better to multistep tasks compared to traditional models. Our work offers the tantalizing possibility of more accurate function-calling compared to today's best-performing models using significantly smaller LLMs. These advancements have significant implications for the development of more capable AI assistants and the broader application of LLMs in real-world scenarios.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of capabilities of large - language models (LLMs) in practical applications, especially in interacting with external tools and APIs. Although LLMs perform excellently in various tasks, their economic impact is limited, especially in tasks that require precise interaction with external tools and APIs. Specifically, LLMs perform poorly in function calling, which has led to criticism of highly - anticipated AI devices such as RabbitR1 and Humane AI Pin for being unable to reliably complete user tasks. Moreover, although chatbots and code assistants can increase productivity, as of September 2024, few jobs have been completely replaced by AI, highlighting a crucial gap between the theoretical capabilities and practical performance of LLMs. To address these issues, the paper introduces ThorV2, a new architecture designed to enhance the function - calling capabilities of LLMs. ThorV2 adopts a new method of "edge - of - domain modeling", focusing on correcting errors in LLM outputs rather than providing comprehensive instructions in advance. The paper also develops a comprehensive benchmark based on HubSpot CRM operations to evaluate the performance of ThorV2 relative to leading models of OpenAI and Anthropic. The main contributions of the paper include: - Introducing ThorV2, a new architecture for enhancing the function - calling capabilities of LLMs. - Developing a comprehensive benchmark for evaluating function - calling performance. - Proposing a new reliability metric for measuring consistent performance in multiple tests. - Demonstrating the superior performance of ThorV2 in terms of accuracy, reliability, latency, and cost - efficiency. - Showing the minimal performance degradation of ThorV2 when handling complex multi - step tasks. Through these improvements, the paper demonstrates the possibility of achieving more accurate function - calling with significantly smaller LLMs than the current best models, which is of great significance for developing more powerful AI assistants and more widely applying LLMs in real - world scenarios.

Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling

WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models

NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

A Comprehensive Performance Study of Large Language Models on Novel AI Accelerators

ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access Networks

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

GraphArena: Benchmarking Large Language Models on Graph Computational Problems

Octopus: On-device language model for function calling of software APIs

WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks

Benchmarking Agentic Workflow Generation

From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs

Strategic Prompting for Conversational Tasks: A Comparative Analysis of Large Language Models Across Diverse Conversational Tasks

BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents

Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?

xLAM: A Family of Large Action Models to Empower AI Agent Systems

AgentBench: Evaluating LLMs as Agents

Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators

The Llama 3 Herd of Models