Abstract:Despite tremendous improvements in natural language generation, summarization models still suffer from the unfaithfulness issue. Previous work evaluates faithfulness either using models trained on the other tasks or in-domain synthetic data, or prompting a large model such as ChatGPT. This paper proposes to do zero-shot faithfulness evaluation simply with a moderately-sized foundation language model. We introduce a new metric FFLM, which is a combination of probability changes based on the intuition that prefixing a piece of text that is consistent with the output will increase the probability of predicting the output. Experiments show that FFLM performs competitively with or even outperforms ChatGPT on both inconsistency detection and faithfulness rating with 24x fewer parameters. FFLM also achieves improvements over other strong baselines.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the unfaithfulness problem in text summarization. Although natural language generation technology has made remarkable progress, summary models still have the problem of generating unfaithful summaries. Specifically, these models sometimes generate summaries that are inconsistent with the source document or contain misinformation. Therefore, developing an effective faithfulness evaluation metric is crucial for implementing summary systems in practical applications and developing more faithful summary models. ### Main contributions 1. **Propose a zero - sample faithfulness evaluation method based on the base language model**: - The authors propose a new metric named FFLM for zero - sample faithfulness evaluation based on a medium - sized base language model. FFLM measures the faithfulness of a summary by calculating probability changes in different ways. 2. **Introduce the comprehensive evaluation metric FFLM**: - FFLM provides a comprehensive faithfulness scoring method by calculating the generation probability changes of a given text under different conditions, combining the changes of prior probability and conditional probability. The authors verify the rationality of its design. 3. **Experimental results**: - The experimental results show that the FFLM of the LLaMa model with 7 billion parameters performs as well as or even better than ChatGPT with 175 billion parameters on multiple datasets. FFLM performs well in both inconsistency detection and faithfulness scoring tasks. ### Method overview #### 2.1 Probability changes in faithfulness measurement - **Prior probability change**: - Calculate the prior probability \(p_{\text{lm}}(Y)\) of the summary \(Y\) and the sequence - to - sequence probability \(p_{\text{s2s}}(Y|X)\) of a given document \(X\). - If \(Y\) is a faithful summary, then \(p_{\text{s2s}}(Y|X)\) should be greater than \(p_{\text{lm}}(Y)\), because \(X\) provides more information consistent with \(Y\). - The faithfulness measurement formula is: \[ \Delta_{\text{prior}}^p(Y)=\frac{1}{m}\sum_{i = 1}^m\left(p_{\text{s2s}}(y_i)-p_{\text{lm}}(y_i)\right) \] - **Conditional probability change**: - Influence the generation probability of \(Y\) by adding additional information \(P\). If \(Y\) is inconsistent with \(X\), adding \(P\) will provide additional evidence, resulting in an increase in the generation probability. - The faithfulness measurement formula is: \[ \Delta_{\text{cond}}^p(Y)=\frac{1}{m}\sum_{i = 1}^m\left(p_{\text{s2s}}(y_i)-p_{\text{pref}}(y_i)\right) \] #### 2.2 Feasible design of FFLM - **Emphasis on low - probability words**: - By taking the logarithm of the probability and re - weighting the probability change of each word, FFLM pays more attention to low - probability words. - The final FFLM formula is: \[ \text{FFLM}=\alpha\Delta_{\text{prior}}^p(Y)+\beta\Delta_{\text{prior}}^p(X)+\delta\Delta_{\text{cond}}^p(Y) \] - where \(\alpha, \beta, \delta\) are weighting parameters, ranging from 0 to 1, and \(\alpha+\beta+\delta = 1\). ### Experimental setup #### 3.1 Inconsistency detection - **Dataset**: - Use

Zero-shot Faithfulness Evaluation for Text Summarization with Foundation Language Model

Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-Consistency

FABLES: Evaluating faithfulness and content selection in book-length summarization

Towards Improving Faithfulness in Abstractive Summarization

Analyzing and Evaluating Faithfulness in Dialogue Summarization

On Positional Bias of Faithfulness for Long-form Summarization

FRSUM: Towards Faithful Abstractive Summarization Via Enhancing Factual Robustness

Fine-Grained Natural Language Inference Based Faithfulness Evaluation for Diverse Summarisation Tasks

Evaluation of Faithfulness Using the Longest Supported Subsequence

FRSUM: Towards Faithful Abstractive Summarization via Enhancing Factual Robustness

Evaluating the Factual Consistency of Large Language Models Through News Summarization

Faithfulness-Aware Decoding Strategies for Abstractive Summarization

STORYSUMM: Evaluating Faithfulness in Story Summarization

Extractive Summarization via ChatGPT for Faithful Summary Generation

CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning

FaMeSumm: Investigating and Improving Faithfulness of Medical Summarization

Zero-Shot Strategies for Length-Controllable Summarization

Benchmarking Large Language Models for News Summarization

Evaluating Factual Consistency of Summaries with Large Language Models

Assessing LLMs for Zero-shot Abstractive Summarization Through the Lens of Relevance Paraphrasing

Fine-grained Factual Consistency Assessment for Abstractive Summarization Models