Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models

Cheng-Yu Hsieh,Si-An Chen,Chun-Liang Li,Yasuhisa Fujii,Alexander Ratner,Chen-Yu Lee,Ranjay Krishna,Tomas Pfister
2023-08-02
Abstract:Today, large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool's usage. Unfortunately, demonstrations are hard to acquire, and can result in undesirable biased usage if the wrong demonstration is chosen. Even in the rare scenario that demonstrations are readily available, there is no principled selection protocol to determine how many and which ones to provide. As tasks grow more complex, the selection search grows combinatorially and invariably becomes intractable. Our work provides an alternative to demonstrations: tool documentation. We advocate the use of tool documentation, descriptions for the individual tool usage, over demonstrations. We substantiate our claim through three main empirical findings on 6 tasks across both vision and language modalities. First, on existing benchmarks, zero-shot prompts with only tool documentation are sufficient for eliciting proper tool usage, achieving performance on par with few-shot prompts. Second, on a newly collected realistic tool-use dataset with hundreds of available tool APIs, we show that tool documentation is significantly more valuable than demonstrations, with zero-shot documentation significantly outperforming few-shot without documentation. Third, we highlight the benefits of tool documentations by tackling image generation and video tracking using just-released unseen state-of-the-art models as tools. Finally, we highlight the possibility of using tool documentation to automatically enable new applications: by using nothing more than the documentation of GroundingDino, Stable Diffusion, XMem, and SAM, LLMs can re-invent the functionalities of the just-released Grounded-SAM and Track Anything models.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to reduce the dependence on example demonstrations (demos) and instead utilize tool documents (docs) when large - language models (LLMs) use new tools. Currently, LLMs learn how to use new tools by providing a small number of examples of using tools, but this method has several problems: it is difficult to obtain high - quality example demonstrations, and improper selection may lead to a decline in model performance or bias. In addition, when the task becomes complex, it becomes extremely difficult to select the appropriate number and content of example demonstrations. Therefore, the paper proposes an alternative - using tool documents to guide LLMs on how to use new tools, thereby achieving zero - shot tool use. In this way, not only can the dependence on example demonstrations be reduced, but also the ability of LLMs to support a large number of tools can be more effectively expanded. The paper proves through experiments that using only tool documents, the performance of LLMs on a variety of tasks can be comparable to or even better than that using a small number of example demonstrations. This shows that tool documents can effectively reduce the need for example demonstrations and improve the adaptability and efficiency of LLMs when facing new tools. At the same time, the paper also shows that LLMs can seamlessly integrate new visual and video processing tools by reading documents to solve previously unseen tasks, such as image editing and video tracking, further proving the effectiveness and potential of this method.