Abstract:Recent works have shown that large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment. However, existing benchmarks primarily evaluate their innate capabilities and do not assess their ability to improve over time. To address this gap, we introduce StreamBench, a pioneering benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence. StreamBench simulates an online learning environment where LLMs receive a continuous flow of feedback stream and iteratively enhance their performance. In addition, we propose several simple yet effective baselines for improving LLMs on StreamBench, and provide a comprehensive analysis to identify critical components that contribute to successful streaming strategies. Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios. Source code: <a class="link-external link-https" href="https://github.com/stream-bench/stream-bench" rel="external noopener nofollow">this https URL</a>. Benchmark website: <a class="link-external link-https" href="https://stream-bench.github.io" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the current evaluation of the capabilities of large language model (LLM) agents mainly focuses on their inherent capabilities, lacking the evaluation of their self - improvement ability after receiving continuous feedback. Specifically, the existing benchmark tests mainly evaluate the static performance of LLM agents before deployment, but do not evaluate their ability to gradually improve performance through experience during actual use. This has led to an important evaluation gap, that is, how to effectively evaluate the continuous improvement ability of LLM agents after receiving user feedback.
To fill this gap, the paper introduces **StreamBench**, an innovative benchmarking framework designed to evaluate the continuous improvement ability of LLM agents after receiving input - feedback sequences. StreamBench simulates an online learning environment in which LLM agents receive a continuous feedback stream and improve their performance through iteration. In addition, the paper also proposes some simple but effective baseline methods to improve the performance of LLM agents on StreamBench and conducts a comprehensive analysis to identify the key components of successful streaming strategies.
### Main contributions
1. **Introduction of StreamBench**: This is the first benchmarking framework designed to evaluate the continuous improvement ability of LLM agents after receiving input - feedback sequences, covering a wide range of task types.
2. **Proposing baseline methods**: The paper proposes several simple but effective baseline methods to enhance the performance of LLM agents in streaming scenarios, including a cost - effective multi - agent method that outperforms other baseline methods while maintaining the average cost of a single agent.
3. **Analysis of advantages and potential problems**: The paper analyzes the advantages and potential problems of the proposed baseline methods, providing insights into the effectiveness of LLM streaming strategies.
### Setting of streaming scenarios
- **Agent**: Defined as a parameterized LLM, enhanced with some additional components such as external memory \( M \) and a retriever \( r(\cdot) \) for storing and retrieving useful information. Given a natural language instance \( x \), a prompt template \( p(\cdot) \) and a retrieval function \( r(\cdot) \), the output of the agent is represented as \( \hat{y}=f(p(x, r(M))|\theta) \).
- **Environment**: The external environment \( g(\cdot) \) provides feedback signals, the form of which depends on the specific downstream task and the type of feedback collected.
- **Input - feedback sequence**: Consider an input stream sequence, each input is represented as \( x_t \), where \( t \) represents the \( t \)-th time step. After the agent provides the output \( \hat{y}_t \), the environment provides a feedback signal \( fb_t = g(x_t,\hat{y}_t) \).
### Evaluation metrics
- **Final performance**: In practice, the goal of the agent is to meet as many user needs as possible in the time series. Therefore, the performance of the agent is evaluated by evaluating the aggregated metric at the final time step \( T \). For example, the final metric for a given dataset can be calculated as:
\[
\frac{\sum_{t = 1}^{T}h(\hat{y}_t,y_t)}{T}
\]
where \( h \) is a function for calculating the corresponding metric for a given dataset.
### Experimental setup
- **Dataset**: Multiple downstream task datasets with potential real - world applications are selected, including text - to - SQL conversion, Python programming, tool use, medical diagnosis, and question - answering tasks.
- **Baseline methods**: Include non - streaming methods (such as zero - sample, few - sample, chain - of - thought, and self - refinement) and streaming methods (such as GrowPrompt, MemPrompt, Self - StreamICL, and multi - agent memory - streaming ICL).
### Results and discussion
- **Main results**: Streaming methods significantly outperform non - streaming methods on most datasets, especially when using simple correctness feedback, the agent can be further improved through self - generated outputs.
- **Key insights**: Collecting and using correct self - generated outputs is crucial for streaming improvement; sharing among multiple agents...