LaMSUM: Creating Extractive Summaries of User Generated Content using LLMs

Garima Chhikara,Anurag Sharma,V. Gurucharan,Kripabandhu Ghosh,Abhijnan Chakraborty
2024-08-23
Abstract:Large Language Models (LLMs) have demonstrated impressive performance across a wide range of NLP tasks, including summarization. LLMs inherently produce abstractive summaries by paraphrasing the original text, while the generation of extractive summaries - selecting specific subsets from the original text - remains largely unexplored. LLMs have a limited context window size, restricting the amount of data that can be processed at once. We tackle this challenge by introducing LaMSUM, a novel multi-level framework designed to generate extractive summaries from large collections of user-generated text using LLMs. LaMSUM integrates summarization with different voting methods to achieve robust summaries. Extensive evaluation using four popular LLMs (Llama 3, Mixtral, Gemini, GPT-4o) demonstrates that LaMSUM outperforms state-of-the-art extractive summarization methods. Overall, this work represents one of the first attempts to achieve extractive summarization by leveraging the power of LLMs, and is likely to spark further interest within the research community.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of using large - language models (LLMs) to generate extractive summarization. Specifically, the paper focuses on the following two main challenges: 1. **Limitations of generating extractive summaries**: - LLMs usually tend to generate abstractive summaries, that is, generate new texts by rewriting the original text instead of directly selecting sentences from the original text. - Due to the limited context window size of LLMs, they cannot process a large amount of text at one time, which limits their ability to handle long texts. 2. **Handling large - scale user - generated content**: - A large amount of user - generated content (such as Twitter, Facebook posts, etc.) is generated on social media platforms, and these contents need effective summarization methods to extract key information. - Extractive summarization is particularly important when handling user - generated content because it can preserve the original words of users and avoid information distortion. ### Solutions To solve the above problems, the paper proposes a multi - level framework named LaMSUM, which uses LLMs and voting algorithms to generate extractive summaries. The specific steps are as follows: 1. **Multi - level summary generation**: - Divide the input text into multiple chunks, and generate a summary for each chunk through LLM. - By merging these summary chunks, form new input chunks and continue to generate summaries until the summary of the required length is finally generated. 2. **Handling position bias**: - Generate multiple variants by randomly shuffling the order of sentences in each chunk to reduce the position bias problem of LLMs. - Use different voting algorithms (such as Plurality Voting, Proportional Approval Voting and Borda Count) to select the best summary units. 3. **Application of voting algorithms**: - Regard the generated multiple summary variants as ballots in an election and use voting algorithms to select the most appropriate summary units. - Different voting algorithms are suitable for different application scenarios. For example, Plurality Voting is suitable for simple majority selection, while Borda Count is suitable for rank - based selection. ### Experimental results The paper verifies the effectiveness of the LaMSUM framework through experiments. The experimental results show that LaMSUM is significantly superior to the existing extractive summarization methods on multiple datasets, especially when handling large - scale user - generated content. ### Summary LaMSUM successfully solves the challenges of generating extractive summaries by combining LLMs and voting algorithms, providing an effective method for handling large - scale user - generated content. This work not only shows the potential of LLMs in summary generation but also provides a new direction for future research.