Cleared for Takeoff? Compositional & Conditional Reasoning may be the Achilles Heel to (Flight-Booking) Language Agents

Harsh Kohli,Huan Sun

2024-04-06

Abstract:The rapid progress of large language models (LLMs) has seen them excel and frequently surpass human performance on standard benchmarks. This has enabled many downstream applications, such as LLM agents, to rely on their sophisticated reasoning to navigate complex task requirements. However, LLMs are known to unexpectedly falter in simple tasks and under seemingly straightforward circumstances - underscoring the need for better and more diverse evaluation setups to measure their true capabilities. To this end, we choose to study compositional and conditional reasoning, two cornerstones of human cognition, and introduce GroundCocoa - a lexically diverse benchmark connecting these reasoning skills to the real-world problem of flight booking. Our task involves aligning detailed user preferences with available flight options presented in a multiple-choice format. Results indicate a significant disparity in performance among current state-of-the-art LLMs with even the best performing model, GPT-4 Turbo, not exceeding 67% accuracy despite advanced prompting techniques.

Computation and Language

What problem does this paper attempt to address?

This paper focuses on the conditional and compositional reasoning abilities that large-scale language models (LLMs) face in complex tasks such as flight booking agents. Although LLMs perform well on standard benchmark tests, even surpassing human performance, they may fail in some simpler tasks, suggesting a need for better evaluation methods to measure their true capabilities. The paper proposes a new benchmark test called GroundCocoa, which connects these two reasoning skills with real-world flight booking problems. The task involves aligning detailed user preferences with multiple options of available flights. GroundCocoa includes a controlled method to generate samples of varying complexity and tests the current state-of-the-art LLMs, such as GPT-4 Turbo. Even with advanced prompting techniques, their performance only achieves approximately 67% accuracy. The study finds that the Chain of Thought (COT) prompt brings only moderate performance improvements in certain cases, and LLMs struggle with handling complex steps. Additionally, uncommon user requirements lead to a decrease in model performance, indicating potential bias in pretraining. The paper also analyzes the decrease in model performance as the complexity of conditions and combinations increases, and explains why the model's performance may vary in queries with similar complexity through entropy measuring the perplexity of answer options. Finally, the paper points out the challenges that even state-of-the-art LLMs face in conditional reasoning and grounding tasks, emphasizing the importance of more in-depth evaluation of these models' reasoning abilities.

Cleared for Takeoff? Compositional & Conditional Reasoning may be the Achilles Heel to (Flight-Booking) Language Agents

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

Combinatorial Reasoning: Selecting Reasons in Generative AI Pipelines via Combinatorial Optimization

Case Study: Testing Model Capabilities in Some Reasoning Tasks

A Human-Like Reasoning Framework for Multi-Phases Planning Task with Large Language Models

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models

Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs

Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?

Conditional and Modal Reasoning in Large Language Models

Can We Rely on LLM Agents to Draft Long-Horizon Plans? Let's Take TravelPlanner as an Example

CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge

Reasoning with Language Model is Planning with World Model

Reasoning in Conversation: Solving Subjective Tasks through Dialogue Simulation for Large Language Models

Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

From Words to Worlds: Compositionality for Cognitive Architectures

Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?