Zero-shot Faithfulness Evaluation for Text Summarization with Foundation Language Model

Qi Jia,Siyu Ren,Yizhu Liu,Kenny Q. Zhu
2023-12-14
Abstract:Despite tremendous improvements in natural language generation, summarization models still suffer from the unfaithfulness issue. Previous work evaluates faithfulness either using models trained on the other tasks or in-domain synthetic data, or prompting a large model such as ChatGPT. This paper proposes to do zero-shot faithfulness evaluation simply with a moderately-sized foundation language model. We introduce a new metric FFLM, which is a combination of probability changes based on the intuition that prefixing a piece of text that is consistent with the output will increase the probability of predicting the output. Experiments show that FFLM performs competitively with or even outperforms ChatGPT on both inconsistency detection and faithfulness rating with 24x fewer parameters. FFLM also achieves improvements over other strong baselines.
Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the unfaithfulness problem in text summarization. Although natural language generation technology has made remarkable progress, summary models still have the problem of generating unfaithful summaries. Specifically, these models sometimes generate summaries that are inconsistent with the source document or contain misinformation. Therefore, developing an effective faithfulness evaluation metric is crucial for implementing summary systems in practical applications and developing more faithful summary models. ### Main contributions 1. **Propose a zero - sample faithfulness evaluation method based on the base language model**: - The authors propose a new metric named FFLM for zero - sample faithfulness evaluation based on a medium - sized base language model. FFLM measures the faithfulness of a summary by calculating probability changes in different ways. 2. **Introduce the comprehensive evaluation metric FFLM**: - FFLM provides a comprehensive faithfulness scoring method by calculating the generation probability changes of a given text under different conditions, combining the changes of prior probability and conditional probability. The authors verify the rationality of its design. 3. **Experimental results**: - The experimental results show that the FFLM of the LLaMa model with 7 billion parameters performs as well as or even better than ChatGPT with 175 billion parameters on multiple datasets. FFLM performs well in both inconsistency detection and faithfulness scoring tasks. ### Method overview #### 2.1 Probability changes in faithfulness measurement - **Prior probability change**: - Calculate the prior probability \(p_{\text{lm}}(Y)\) of the summary \(Y\) and the sequence - to - sequence probability \(p_{\text{s2s}}(Y|X)\) of a given document \(X\). - If \(Y\) is a faithful summary, then \(p_{\text{s2s}}(Y|X)\) should be greater than \(p_{\text{lm}}(Y)\), because \(X\) provides more information consistent with \(Y\). - The faithfulness measurement formula is: \[ \Delta_{\text{prior}}^p(Y)=\frac{1}{m}\sum_{i = 1}^m\left(p_{\text{s2s}}(y_i)-p_{\text{lm}}(y_i)\right) \] - **Conditional probability change**: - Influence the generation probability of \(Y\) by adding additional information \(P\). If \(Y\) is inconsistent with \(X\), adding \(P\) will provide additional evidence, resulting in an increase in the generation probability. - The faithfulness measurement formula is: \[ \Delta_{\text{cond}}^p(Y)=\frac{1}{m}\sum_{i = 1}^m\left(p_{\text{s2s}}(y_i)-p_{\text{pref}}(y_i)\right) \] #### 2.2 Feasible design of FFLM - **Emphasis on low - probability words**: - By taking the logarithm of the probability and re - weighting the probability change of each word, FFLM pays more attention to low - probability words. - The final FFLM formula is: \[ \text{FFLM}=\alpha\Delta_{\text{prior}}^p(Y)+\beta\Delta_{\text{prior}}^p(X)+\delta\Delta_{\text{cond}}^p(Y) \] - where \(\alpha, \beta, \delta\) are weighting parameters, ranging from 0 to 1, and \(\alpha+\beta+\delta = 1\). ### Experimental setup #### 3.1 Inconsistency detection - **Dataset**: - Use