CovScore: Evaluation of Multi-Document Abstractive Title Set Generation

Itamar Trainin,Omri Abend
2024-07-25
Abstract:This paper introduces CovScore, an automatic reference-less methodology for evaluating thematic title sets, extracted from a corpus of documents. While such extraction methods are widely used, evaluating their effectiveness remains an open question. Moreover, some existing practices heavily rely on slow and laborious human annotation procedures. Inspired by recently introduced LLM-based judge methods, we propose a novel methodology that decomposes quality into five main metrics along different aspects of evaluation. This framing simplifies and expedites the manual evaluation process and enables automatic and independent LLM-based evaluation. As a test case, we apply our approach to a corpus of Holocaust survivor testimonies, motivated both by its relevance to title set extraction and by the moral significance of this pursuit. We validate the methodology by experimenting with naturalistic and synthetic title set generation systems and compare their performance with the methodology.
Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the issue of evaluating Multi-Document Abstractive Title Sets. Specifically, it proposes a new reference-less evaluation method called **CovScore** to assess the quality of thematic title sets extracted from a collection of documents. #### Main Objectives: 1. **Simplify the manual evaluation process**: By breaking down quality into five main metrics (interpretability, coverage, non-overlap, inner order), the manual evaluation process is simplified. 2. **Automated independent evaluation**: The proposed method supports not only manual evaluation but also automated evaluation based on large language models (LLMs). 3. **Improve evaluation efficiency**: It addresses the problem of existing evaluation methods relying on slow and time-consuming human annotations. #### Application Case: - The paper demonstrates the effectiveness of this method using a dataset of Holocaust survivor testimonies and validates its value in practical applications. #### Methodological Contributions: - A framework is proposed that decomposes the quality of title sets into several quantifiable aspects, including Interpretability, Coverage, Non-Overlap, and Inner-Order. - The effectiveness and reliability of the method are validated through comparative experiments between human annotations and automated evaluation models (such as GPT-4, LLAMA-3, etc.). #### Validation Study: - 13 different title generation systems were designed and implemented, and their performance was compared using various evaluation metrics. - Experimental results show that CovScore effectively captures the complex trade-offs between different systems, validating its effectiveness as a system-level comparative measure.