Cleared for Takeoff? Compositional & Conditional Reasoning may be the Achilles Heel to (Flight-Booking) Language Agents

Harsh Kohli,Huan Sun
2024-04-06
Abstract:The rapid progress of large language models (LLMs) has seen them excel and frequently surpass human performance on standard benchmarks. This has enabled many downstream applications, such as LLM agents, to rely on their sophisticated reasoning to navigate complex task requirements. However, LLMs are known to unexpectedly falter in simple tasks and under seemingly straightforward circumstances - underscoring the need for better and more diverse evaluation setups to measure their true capabilities. To this end, we choose to study compositional and conditional reasoning, two cornerstones of human cognition, and introduce GroundCocoa - a lexically diverse benchmark connecting these reasoning skills to the real-world problem of flight booking. Our task involves aligning detailed user preferences with available flight options presented in a multiple-choice format. Results indicate a significant disparity in performance among current state-of-the-art LLMs with even the best performing model, GPT-4 Turbo, not exceeding 67% accuracy despite advanced prompting techniques.
Computation and Language
What problem does this paper attempt to address?
This paper focuses on the conditional and compositional reasoning abilities that large-scale language models (LLMs) face in complex tasks such as flight booking agents. Although LLMs perform well on standard benchmark tests, even surpassing human performance, they may fail in some simpler tasks, suggesting a need for better evaluation methods to measure their true capabilities. The paper proposes a new benchmark test called GroundCocoa, which connects these two reasoning skills with real-world flight booking problems. The task involves aligning detailed user preferences with multiple options of available flights. GroundCocoa includes a controlled method to generate samples of varying complexity and tests the current state-of-the-art LLMs, such as GPT-4 Turbo. Even with advanced prompting techniques, their performance only achieves approximately 67% accuracy. The study finds that the Chain of Thought (COT) prompt brings only moderate performance improvements in certain cases, and LLMs struggle with handling complex steps. Additionally, uncommon user requirements lead to a decrease in model performance, indicating potential bias in pretraining. The paper also analyzes the decrease in model performance as the complexity of conditions and combinations increases, and explains why the model's performance may vary in queries with similar complexity through entropy measuring the perplexity of answer options. Finally, the paper points out the challenges that even state-of-the-art LLMs face in conditional reasoning and grounding tasks, emphasizing the importance of more in-depth evaluation of these models' reasoning abilities.