Abstract:LLMs can now act as autonomous agents that interact with digital environments and complete specific objectives (e.g., arranging an online meeting). However, accuracy is still far from satisfactory, partly due to a lack of large-scale, direct demonstrations for digital tasks. Obtaining supervised data from humans is costly, and automatic data collection through exploration or reinforcement learning relies on complex environmental and content setup, resulting in datasets that lack comprehensive coverage of various scenarios. On the other hand, there is abundant knowledge that may indirectly assist task completion, such as online tutorials that were created for human consumption. In this work, we present Synatra, an approach that effectively transforms this indirect knowledge into direct supervision at scale. We define different types of indirect knowledge, and carefully study the available sources to obtain it, methods to encode the structure of direct demonstrations, and finally methods to transform indirect knowledge into direct demonstrations. We use 100k such synthetically-created demonstrations to finetune a 7B CodeLlama, and demonstrate that the resulting agent surpasses all comparably sized models on three web-based task benchmarks Mind2Web, MiniWoB++ and WebArena, as well as surpassing GPT-3.5 on WebArena and Mind2Web. In addition, while synthetic demonstrations prove to be only 3% the cost of human demonstrations (at $0.031 each), we show that the synthetic demonstrations can be more effective than an identical number of human demonstrations collected from limited domains.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of insufficient accuracy of current large - language models (LLMs) when performing complex tasks in the digital environment. Specifically, LLMs are not yet strong enough in achieving specific goals (such as scheduling online meetings), partly because of the lack of large - scale, direct task - demonstration data. Obtaining human - annotated supervision data is costly, and automatically collecting data through exploration or reinforcement learning depends on complex environmental and content settings, resulting in generated data sets that cannot comprehensively cover all scenarios. To solve these problems, the paper proposes a method named **Synatra**, which can effectively transform indirect knowledge (such as online tutorials written for humans) into large - scale direct - demonstration data. Through this method, researchers can utilize the existing rich indirect - knowledge resources instead of relying on costly human annotation or limited LLM trajectory data, thereby improving the performance of AI agents in digital tasks. ### Main contributions of Synatra 1. **Transforming indirect knowledge into direct demonstrations**: Synatra defines different types of indirect knowledge, studies methods of obtaining this knowledge, and how to encode the structure of direct demonstrations, and finally transforms indirect knowledge into direct demonstrations. 2. **Generating high - quality synthetic trajectories**: By fine - tuning the CodeLlama model with 7B parameters, Synatra generates 100,000 synthetic task trajectories covering 21 domains. 3. **Outperforming existing models in performance**: Experimental results show that Synatra - CodeLlama outperforms all open - source models of the same scale on three web - based task benchmarks (Mind2Web, MiniWoB++ and WebArena), and even outperforms larger - scale models such as GPT - 3.5 and Lemur - chat - 70b on some tasks. 4. **Low - cost and high - efficiency**: The cost of each synthetic example is only about 3% of that of a human - annotated example, but its effect can exceed the same amount of human - annotated data in a limited domain. ### Summary Synatra provides an efficient and low - cost way to enhance the performance of AI agents in digital tasks by transforming indirect knowledge into direct demonstrations. This not only improves the accuracy of the model but also significantly reduces the cost of data acquisition, providing new ideas and methods for future intelligent - agent development.

Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale

Are Human-generated Demonstrations Necessary for In-context Learning?

NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator

CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation

SynthDa: Exploiting Existing Real-World Data for Usable and Accessible Synthetic Data Generation

Human Demonstrations are Generalizable Knowledge for Robots

LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data

AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations

AdaDemo: Data-Efficient Demonstration Expansion for Generalist Robotic Agent

Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation

WebArena: A Realistic Web Environment for Building Autonomous Agents

Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

AgentInstruct: Toward Generative Teaching with Agentic Flows

DiaSynth: Synthetic Dialogue Generation Framework for Low Resource Dialogue Applications

Synthetica: Large Scale Synthetic Data for Robot Perception

ARCADE: Scalable Demonstration Collection and Generation via Augmented Reality for Imitation Learning

MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Dense Dynamics-Aware Reward Synthesis: Integrating Prior Experience with Demonstrations

Synthetic Dialogue Dataset Generation using LLM Agents