What's Wrong? Refining Meeting Summaries with LLM Feedback

Frederic Kirstein,Terry Ruas,Bela Gipp
2024-07-17
Abstract:Meeting summarization has become a critical task since digital encounters have become a common practice. Large language models (LLMs) show great potential in summarization, offering enhanced coherence and context understanding compared to traditional methods. However, they still struggle to maintain relevance and avoid hallucination. We introduce a multi-LLM correction approach for meeting summarization using a two-phase process that mimics the human review process: mistake identification and summary refinement. We release QMSum Mistake, a dataset of 200 automatically generated meeting summaries annotated by humans on nine error types, including structural, omission, and irrelevance errors. Our experiments show that these errors can be identified with high accuracy by an LLM. We transform identified mistakes into actionable feedback to improve the quality of a given summary measured by relevance, informativeness, conciseness, and coherence. This post-hoc refinement effectively improves summary quality by leveraging multiple LLMs to validate output quality. Our multi-LLM approach for meeting summarization shows potential for similar complex text generation tasks requiring robustness, action planning, and discussion towards a goal.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use large language models (LLMs) to improve the quality of meeting summaries. Specifically, the authors note that although the meeting summaries currently generated by LLMs have good coherence and context - understanding abilities, they still have deficiencies in aspects such as relevance and avoiding hallucinations (i.e., generating untrue content). To solve these problems, the authors propose a multi - LLM correction method, which imitates the human review process through a two - stage process: error identification and summary optimization. ### Specific Problems and Solutions 1. **Error Identification**: - The authors constructed a dataset named QMSum Mistake, which contains 200 automatically generated meeting summaries and was manually annotated with nine types of errors, such as structural errors, missing information, irrelevant information, etc. - Use large language models such as GPT - 4 to identify these errors. Experiments show that GPT - 4 has a high recognition accuracy rate for most error types, but its performance on irrelevant information and hallucination errors is slightly worse. 2. **Summary Optimization**: - After identifying the errors, the authors convert these errors into specific feedback to improve the quality of the summary. They explore different feedback protocols and transmission protocols to determine the best improvement strategy. - The experimental results show that the feedback method combining chain - of - thought (CoT) explanations and correction suggestions can significantly improve the relevance, informativeness, conciseness, and coherence of the summary. ### Main Contributions - Constructed and released the QMSum Mistake dataset, which contains 200 meeting summaries and their manually annotated errors. - Proposed a multi - LLM method to identify errors in meeting summaries and improve them through multiple prompting strategies. - Converted the identified errors into actionable feedback, forming a complete set of summary optimization protocols, which significantly improves the quality of the summary. ### Summary This paper effectively solves the relevance and hallucination problems existing in the meeting summaries generated by existing LLMs by introducing the multi - LLM correction method, providing new ideas and methods for improving the quality of automatic summaries.