Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling

Nirav Bhan,Shival Gupta,Sai Manaswini,Ritik Baba,Narun Yadav,Hillori Desai,Yash Choudhary,Aman Pawar,Sarthak Shrivastava,Sudipta Biswas
2024-10-23
Abstract:Large Language Models (LLMs) have shown remarkable capabilities in various domains, yet their economic impact has been limited by challenges in tool use and function calling. This paper introduces ThorV2, a novel architecture that significantly enhances LLMs' function calling abilities. We develop a comprehensive benchmark focused on HubSpot CRM operations to evaluate ThorV2 against leading models from OpenAI and Anthropic. Our results demonstrate that ThorV2 outperforms existing models in accuracy, reliability, latency, and cost efficiency for both single and multi-API calling tasks. We also show that ThorV2 is far more reliable and scales better to multistep tasks compared to traditional models. Our work offers the tantalizing possibility of more accurate function-calling compared to today's best-performing models using significantly smaller LLMs. These advancements have significant implications for the development of more capable AI assistants and the broader application of LLMs in real-world scenarios.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the lack of capabilities of large - language models (LLMs) in practical applications, especially in interacting with external tools and APIs. Although LLMs perform excellently in various tasks, their economic impact is limited, especially in tasks that require precise interaction with external tools and APIs. Specifically, LLMs perform poorly in function calling, which has led to criticism of highly - anticipated AI devices such as RabbitR1 and Humane AI Pin for being unable to reliably complete user tasks. Moreover, although chatbots and code assistants can increase productivity, as of September 2024, few jobs have been completely replaced by AI, highlighting a crucial gap between the theoretical capabilities and practical performance of LLMs. To address these issues, the paper introduces ThorV2, a new architecture designed to enhance the function - calling capabilities of LLMs. ThorV2 adopts a new method of "edge - of - domain modeling", focusing on correcting errors in LLM outputs rather than providing comprehensive instructions in advance. The paper also develops a comprehensive benchmark based on HubSpot CRM operations to evaluate the performance of ThorV2 relative to leading models of OpenAI and Anthropic. The main contributions of the paper include: - Introducing ThorV2, a new architecture for enhancing the function - calling capabilities of LLMs. - Developing a comprehensive benchmark for evaluating function - calling performance. - Proposing a new reliability metric for measuring consistent performance in multiple tests. - Demonstrating the superior performance of ThorV2 in terms of accuracy, reliability, latency, and cost - efficiency. - Showing the minimal performance degradation of ThorV2 when handling complex multi - step tasks. Through these improvements, the paper demonstrates the possibility of achieving more accurate function - calling with significantly smaller LLMs than the current best models, which is of great significance for developing more powerful AI assistants and more widely applying LLMs in real - world scenarios.