Abstract:This paper introduces CovScore, an automatic reference-less methodology for evaluating thematic title sets, extracted from a corpus of documents. While such extraction methods are widely used, evaluating their effectiveness remains an open question. Moreover, some existing practices heavily rely on slow and laborious human annotation procedures. Inspired by recently introduced LLM-based judge methods, we propose a novel methodology that decomposes quality into five main metrics along different aspects of evaluation. This framing simplifies and expedites the manual evaluation process and enables automatic and independent LLM-based evaluation. As a test case, we apply our approach to a corpus of Holocaust survivor testimonies, motivated both by its relevance to title set extraction and by the moral significance of this pursuit. We validate the methodology by experimenting with naturalistic and synthetic title set generation systems and compare their performance with the methodology.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the issue of evaluating Multi-Document Abstractive Title Sets. Specifically, it proposes a new reference-less evaluation method called **CovScore** to assess the quality of thematic title sets extracted from a collection of documents. #### Main Objectives: 1. **Simplify the manual evaluation process**: By breaking down quality into five main metrics (interpretability, coverage, non-overlap, inner order), the manual evaluation process is simplified. 2. **Automated independent evaluation**: The proposed method supports not only manual evaluation but also automated evaluation based on large language models (LLMs). 3. **Improve evaluation efficiency**: It addresses the problem of existing evaluation methods relying on slow and time-consuming human annotations. #### Application Case: - The paper demonstrates the effectiveness of this method using a dataset of Holocaust survivor testimonies and validates its value in practical applications. #### Methodological Contributions: - A framework is proposed that decomposes the quality of title sets into several quantifiable aspects, including Interpretability, Coverage, Non-Overlap, and Inner-Order. - The effectiveness and reliability of the method are validated through comparative experiments between human annotations and automated evaluation models (such as GPT-4, LLAMA-3, etc.). #### Validation Study: - 13 different title generation systems were designed and implemented, and their performance was compared using various evaluation metrics. - Experimental results show that CovScore effectively captures the complex trade-offs between different systems, validating its effectiveness as a system-level comparative measure.

CovScore: Evaluation of Multi-Document Abstractive Title Set Generation

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework

Domain Controlled Title Generation with Human Evaluation

Systematic Task Exploration with LLMs: A Study in Citation Text Generation

Realistic Evaluation Principles for Cross-document Coreference Resolution

Evaluating Factual Consistency of Texts with Semantic Role Labeling

Check-Eval: A Checklist-based Approach for Evaluating Text Quality

Optimizing the role of human evaluation in LLM-based spoken document summarization systems

A Closer Look at Claim Decomposition

LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation

Building Contrastive Summaries of Subjective Text Via Opinion Ranking

Looking at words and points with attention: a benchmark for text-to-shape coherence

Towards Fine-Grained Citation Evaluation in Generated Text: A Comparative Analysis of Faithfulness Metrics

Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective

A Comparative Study of Sentence Embedding Models for Assessing Semantic Variation

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation

Semi-automatic System for Title Construction

CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems

TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models