Abstract:Meeting summarization has become a critical task since digital encounters have become a common practice. Large language models (LLMs) show great potential in summarization, offering enhanced coherence and context understanding compared to traditional methods. However, they still struggle to maintain relevance and avoid hallucination. We introduce a multi-LLM correction approach for meeting summarization using a two-phase process that mimics the human review process: mistake identification and summary refinement. We release QMSum Mistake, a dataset of 200 automatically generated meeting summaries annotated by humans on nine error types, including structural, omission, and irrelevance errors. Our experiments show that these errors can be identified with high accuracy by an LLM. We transform identified mistakes into actionable feedback to improve the quality of a given summary measured by relevance, informativeness, conciseness, and coherence. This post-hoc refinement effectively improves summary quality by leveraging multiple LLMs to validate output quality. Our multi-LLM approach for meeting summarization shows potential for similar complex text generation tasks requiring robustness, action planning, and discussion towards a goal.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to use large language models (LLMs) to improve the quality of meeting summaries. Specifically, the authors note that although the meeting summaries currently generated by LLMs have good coherence and context - understanding abilities, they still have deficiencies in aspects such as relevance and avoiding hallucinations (i.e., generating untrue content). To solve these problems, the authors propose a multi - LLM correction method, which imitates the human review process through a two - stage process: error identification and summary optimization. ### Specific Problems and Solutions 1. **Error Identification**: - The authors constructed a dataset named QMSum Mistake, which contains 200 automatically generated meeting summaries and was manually annotated with nine types of errors, such as structural errors, missing information, irrelevant information, etc. - Use large language models such as GPT - 4 to identify these errors. Experiments show that GPT - 4 has a high recognition accuracy rate for most error types, but its performance on irrelevant information and hallucination errors is slightly worse. 2. **Summary Optimization**: - After identifying the errors, the authors convert these errors into specific feedback to improve the quality of the summary. They explore different feedback protocols and transmission protocols to determine the best improvement strategy. - The experimental results show that the feedback method combining chain - of - thought (CoT) explanations and correction suggestions can significantly improve the relevance, informativeness, conciseness, and coherence of the summary. ### Main Contributions - Constructed and released the QMSum Mistake dataset, which contains 200 meeting summaries and their manually annotated errors. - Proposed a multi - LLM method to identify errors in meeting summaries and improve them through multiple prompting strategies. - Converted the identified errors into actionable feedback, forming a complete set of summary optimization protocols, which significantly improves the quality of the summary. ### Summary This paper effectively solves the relevance and hallucination problems existing in the meeting summaries generated by existing LLMs by introducing the multi - LLM correction method, providing new ideas and methods for improving the quality of automatic summaries.

What's Wrong? Refining Meeting Summaries with LLM Feedback

Summaries, Highlights, and Action items: Design, implementation and evaluation of an LLM-powered meeting recap system

Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator

Tell me what I need to know: Exploring LLM-based (Personalized) Abstractive Multi-Source Meeting Summarization

What's under the hood: Investigating Automatic Metrics on Meeting Summarization

Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends

Learning to Summarize from LLM-generated Feedback

Leveraging the Power of LLMs: A Fine-Tuning Approach for High-Quality Aspect-Based Summarization

LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization

Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-Consistency

Towards a Robust Retrieval-Based Summarization System

CREAM: Comparison-Based Reference-Free ELO-Ranked Automatic Evaluation for Meeting Summarization

Prompting and Fine-Tuning of Small LLMs for Length-Controllable Telephone Call Summarization

On Learning to Summarize with Large Language Models as References

Query-OPT: Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization

Assessing LLMs for Zero-shot Abstractive Summarization Through the Lens of Relevance Paraphrasing

A Framework to Assess Clinical Safety and Hallucination Rates of LLMs for Medical Text Summarisation

Investigating Consistency in Query-Based Meeting Summarization: A Comparative Study of Different Embedding Methods

Summarization is (Almost) Dead

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

Evaluating Factual Consistency of Summaries with Large Language Models