Abstract:To enhance large language models (LLMs) for chemistry problem solving, several LLM-based agents augmented with tools have been proposed, such as ChemCrow and Coscientist. However, their evaluations are narrow in scope, leaving a large gap in understanding the benefits of tools across diverse chemistry tasks. To bridge this gap, we develop ChemAgent, an enhanced chemistry agent over ChemCrow, and conduct a comprehensive evaluation of its performance on both specialized chemistry tasks and general chemistry questions. Surprisingly, ChemAgent does not consistently outperform its base LLMs without tools. Our error analysis with a chemistry expert suggests that: For specialized chemistry tasks, such as synthesis prediction, we should augment agents with specialized tools; however, for general chemistry questions like those in exams, agents' ability to reason correctly with chemistry knowledge matters more, and tool augmentation does not always help.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate the performance of tool - enhanced language model (LLM) agents in chemical problem - solving, especially whether these agents can always outperform the basic large - language models in different types of chemical tasks. Specifically: 1. **The problem of narrow evaluation scope**: Existing tool - enhanced agents such as ChemCrow and Coscientist have shown certain potential, but their evaluations are mainly concentrated on a few specific tasks, failing to fully understand the actual performance of these agents in diverse chemical tasks. 2. **Understanding the effect of tool enhancement**: The research aims to deeply explore the impact of tool enhancement on the large - language model's ability to solve chemical problems, especially the differences between specialized tasks (such as synthesis prediction) and general problems (such as exam questions). To answer these questions, the author developed ChemAgent, an improved version of the chemical agent, and comprehensively evaluated its performance on specialized chemical tasks and general chemical problems. The study found that: - For specialized tasks, such as those related to molecules and reaction centers, tool enhancement significantly improves performance. - For general problems, tool enhancement does not always lead to performance improvement and sometimes is even inferior to the basic large - language model. Through detailed error analysis, the study points out that when dealing with general problems, agents are prone to making minor errors in the reasoning process, which may be caused by the additional cognitive burden introduced by tool enhancement or the inconsistency between tool output and the internal knowledge of the model. In summary, the core objective of this paper is to reveal the advantages and limitations of tool - enhanced agents in different chemical tasks through systematic evaluation and error analysis, providing guidance for future research.

Tooling or Not Tooling? The Impact of Tools on Language Agents for Chemistry Problem Solving

Augmenting large language models with chemistry tools

ChemCrow: Augmenting large-language models with chemistry tools

SciAgent: Tool-augmented Language Models for Scientific Reasoning

Learning to Use Tools via Cooperative and Interactive Agents

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

CACTUS: Chemistry Agent Connecting Tool-Usage to Science

A multi-agent-driven robotic AI chemist enabling autonomous chemical research on demand

A Review of Large Language Models and Autonomous Agents in Chemistry

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

What Are Tools Anyway? A Survey from the Language Model Perspective

LLM With Tools: A Survey

EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction

MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use

Chemist-X: Large Language Model-empowered Agent for Reaction Condition Recommendation in Chemical Synthesis

GTA: A Benchmark for General Tool Agents

Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation

AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls

MetaTool: Facilitating Large Language Models to Master Tools with Meta-task Augmentation

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent