Large Language Models Can Self-Improve in Long-context Reasoning

Siheng Li,Cheng Yang,Zesen Cheng,Lemao Liu,Mo Yu,Yujiu Yang,Wai Lam
2024-11-13
Abstract:Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. Existing approaches typically involve fine-tuning LLMs with synthetic data, which depends on annotations from human experts or advanced models like GPT-4, thus restricting further advancements. To address this issue, we investigate the potential for LLMs to self-improve in long-context reasoning and propose \ours, an approach specifically designed for this purpose. This approach is straightforward: we sample multiple outputs for each question, score them with Minimum Bayes Risk, and then apply supervised fine-tuning or preference optimization based on these outputs. Extensive experiments on several leading LLMs demonstrate the effectiveness of \ours, with an absolute improvement of $4.2$ points for Llama-3.1-8B-Instruct. Furthermore, \ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models. We anticipate that this work will open new avenues for self-improvement techniques in long-context scenarios, which are essential for the continual advancement of LLMs.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem this paper attempts to address is the insufficient ability of large language models (LLMs) in handling long-text context reasoning. Although existing LLMs have made significant progress in processing long texts, they still perform poorly on tasks that require reasoning across multiple paragraphs. To overcome this limitation, the paper proposes a method called SEALONG, which aims to enable LLMs to self-improve their performance in long-text context reasoning. Specifically, the paper addresses the problem through the following approaches: 1. **Sampling multiple reasoning paths**: For each question and its corresponding long-text context, a plan-and-solve prompting strategy is used to sample multiple reasoning paths from the LLM. 2. **Scoring mechanism**: The outputs are scored using Minimum Bayes Risk (MBR), prioritizing reasoning paths that are consistent with the majority of outputs. 3. **Supervised fine-tuning or preference optimization**: Based on the scoring results, high-scoring outputs can be used for supervised fine-tuning, or both high-scoring and low-scoring outputs can be used for preference optimization. Through these steps, the SEALONG method can effectively enhance the performance of LLMs in long-text context reasoning tasks without the need for human experts or advanced model annotations. Experimental results show that SEALONG achieves significant performance improvements across multiple LLMs, particularly excelling in multi-document question-answering tasks.