Abstract:Large Language Models (LLMs) demonstrate promising capabilities in solving simple scientific problems but often produce hallucinations for complex ones. While integrating LLMs with tools can increase reliability, this approach typically results in over-reliance on tools, diminishing the model's ability to solve simple problems through basic reasoning. In contrast, human experts first assess problem complexity using domain knowledge before choosing an appropriate solution approach. Inspired by this human problem-solving process, we propose a novel two-component fine-tuning method. In the first component World Knowledge Distillation (WKD), LLMs learn directly from solutions generated using tool's information to internalize domain knowledge. In the second component Tool Usage Adaptation (TUA), we partition problems into easy and hard categories based on the model's direct answering accuracy. While maintaining the same alignment target for easy problems as in WKD, we train the model to intelligently switch to tool usage for more challenging problems. We validate our method on six scientific benchmark datasets, spanning mathematics, climate science and epidemiology. On average, our models demonstrate a 28.18% improvement in answer accuracy and a 13.89% increase in tool usage precision across all datasets, surpassing state-of-the-art models including GPT-4o and Claude-3.5.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the hallucination phenomenon that large language models (LLMs) exhibit when dealing with complex scientific problems, and how to effectively utilize external tools to improve the model's reliability and accuracy without losing basic reasoning capabilities. Specifically, the paper proposes a two-stage fine-tuning method to achieve the following goals: 1. **Reduce Hallucination**: LLMs perform well on simple scientific problems but tend to produce incorrect or unreasonable answers (i.e., "hallucinations") when dealing with complex issues. The proposed method aims to reduce such hallucinations. 2. **Balance Tool Usage and Basic Reasoning**: Traditional methods of integrating LLMs with external tools often lead to over-reliance on the tools, thereby reducing the model's basic reasoning ability on simple problems. The proposed method achieves a balance between tool usage and basic reasoning by intelligently selecting when to use tools. 3. **Improve Accuracy and Tool Usage Precision**: The proposed method has been validated on multiple scientific benchmark datasets, showing significant improvements in average answer accuracy and tool usage precision, surpassing existing state-of-the-art models (such as GPT-4 and Claude-3.5). ### Method Overview The proposed method includes two main components: 1. **World Knowledge Distillation (WKD)**: - Through supervised fine-tuning and preference learning, the pre-trained LLM internalizes highly accurate solutions generated by external tools. - The goal is for the LLM to generate solutions directly without relying on tools. 2. **Tool Usage Adaptation (TUA)**: - Questions are categorized into simple and complex based on the accuracy of the model's direct answers. - For simple questions, the alignment goal remains the same as WKD; for complex questions, the model is trained to follow the traces of external tools, achieving intelligent switching. ### Experimental Results - **Answer Accuracy**: On custom datasets, the proposed method significantly outperforms baseline models, and it also shows notable improvements on public datasets. - **Tool Usage Accuracy**: The proposed method achieves the highest tool usage accuracy across all datasets, indicating that the model can intelligently decide when to use tools. ### Main Contributions 1. **Proposed a New Two-Stage Training Paradigm** that enables LLMs to adaptively solve real-world scientific problems of varying complexity. 2. **Constructed Four New Datasets** covering multiple scientific domains, including mathematics, physics, climate science, and epidemiology, to facilitate future research. 3. **Experimental Results Demonstrate the Method's Effectiveness**, achieving better answer accuracy and smarter tool usage decisions across multiple datasets. ### Conclusion The proposed method successfully reduces the hallucination phenomenon in LLMs when dealing with complex scientific problems through world knowledge distillation and tool usage adaptation, while improving the model's accuracy and reliability without losing basic reasoning capabilities.

Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation

Towards Practical Tool Usage for Continually Learning LLMs

Towards Tool Use Alignment of Large Language Models

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

Context Matter: Data-Efficient Augmentation of Large Language Models for Scientific Applications

Large Language Models as Tool Makers

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent

Large Language Models with Controllable Working Memory

Using Advanced LLMs to Enhance Smaller LLMs: An Interpretable Knowledge Distillation Approach

Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs

Efficiently Measuring the Cognitive Ability of LLMs: an Adaptive Testing Perspective

MedAdapter: Efficient Test-Time Adaptation of Large Language Models towards Medical Reasoning

WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models

LLM With Tools: A Survey

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error

Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step