GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Zhen Xiang,Linzhi Zheng,Yanjie Li,Junyuan Hong,Qinbin Li,Han Xie,Jiawei Zhang,Zidi Xiong,Chulin Xie,Carl Yang,Dawn Song,Bo Li
2024-06-13
Abstract:The rapid advancement of large language models (LLMs) has catalyzed the deployment of LLM-powered agents across numerous applications, raising new concerns regarding their safety and trustworthiness. Existing methods for enhancing the safety of LLMs are not directly transferable to LLM-powered agents due to their diverse objectives and output modalities. In this paper, we propose GuardAgent, the first LLM agent as a guardrail to other LLM agents. Specifically, GuardAgent oversees a target LLM agent by checking whether its inputs/outputs satisfy a set of given guard requests defined by the users. GuardAgent comprises two steps: 1) creating a task plan by analyzing the provided guard requests, and 2) generating guardrail code based on the task plan and executing the code by calling APIs or using external engines. In both steps, an LLM is utilized as the core reasoning component, supplemented by in-context demonstrations retrieved from a memory module. Such knowledge-enabled reasoning allows GuardAgent to understand various textual guard requests and accurately "translate" them into executable code that provides reliable guardrails. Furthermore, GuardAgent is equipped with an extendable toolbox containing functions and APIs and requires no additional LLM training, which underscores its generalization capabilities and low operational overhead. Additionally, we propose two novel benchmarks: an EICU-AC benchmark for assessing privacy-related access control for healthcare agents and a Mind2Web-SC benchmark for safety evaluation for web agents. We show the effectiveness of GuardAgent on these two benchmarks with 98.7% and 90.0% accuracy in moderating invalid inputs and outputs for the two types of agents, respectively. We also show that GuardAgent is able to define novel functions in adaption to emergent LLM agents and guard requests, which underscores its strong generalization capabilities.
Machine Learning
What problem does this paper attempt to address?
The paper aims to address the security and trustworthiness issues faced by large language model (LLM)-driven agents in various application scenarios. Specifically, existing methods to enhance LLM security cannot be directly applied to LLM-driven agents with different goals and output modalities. Therefore, the paper proposes GuardAgent, the first LLM agent framework that acts as a guardrail for other LLM agents. GuardAgent ensures the security and compliance of these agents by monitoring the input and output of the target LLM agents and checking whether they meet a series of user-defined protection requests (such as safety rules or privacy policies). Additionally, the paper introduces two new benchmarks: EICU-AC for evaluating privacy-related access control in healthcare agents, and Mind2Web-SC for assessing the security of web agents. Experimental results show that GuardAgent achieved protection accuracies of 98.7% and 90.0% on these two benchmarks, respectively, demonstrating its effectiveness and strong generalization capabilities.