Abstract:Plain language summaries (PLSs) have been introduced to communicate research in an understandable way to a nonexpert audience. Guidelines for writing PLSs have been developed and empirical research on PLSs has been conducted, but terminology and research approaches in this comparatively young field vary considerably. This prompted us to review the current state of the art of the theoretical and empirical literature on PLSs. The two main objectives of this review were to develop a conceptual framework for PLS theory, and to synthesize empirical evidence on PLS criteria. We began by searching Web of Science, PubMed, PsycInfo and PSYNDEX (last search 07/2021). In our review, we included empirical investigations of PLSs, reports on PLS development, PLS guidelines, and theoretical articles referring to PLSs. A conceptual framework was developed through content analysis. Empirical studies investigating effects of PLS criteria on defined outcomes were narratively synthesized. We identified 7,714 records, of which 90 articles met the inclusion criteria. All articles were used to develop a conceptual framework for PLSs which comprises 12 categories: six of PLS aims and six of PLS characteristics. Thirty-three articles empirically investigated effects of PLSs on several outcomes, but study designs were too heterogeneous to identify definite criteria for high-quality PLSs. Few studies identified effects of various criteria on accessibility, understanding, knowledge, communication of research, and empowerment. We did not find empirical evidence to support most of the criteria we identified in the PLS writing guidelines. We conclude that although considerable work on establishing and investigating PLSs is available, empirical evidence on criteria for high-quality PLSs remains scarce. The conceptual framework developed in this review may provide a valuable starting point for future guideline developers and PLS researchers.

APPLS: Evaluating Evaluation Metrics for Plain Language Summarization

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

A Comparative Study of Quality Evaluation Methods for Text Summarization

Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory

Towards Dataset-scale and Feature-oriented Evaluation of Text Summarization in Large Language Model Prompts

PPLqa: An Unsupervised Information-Theoretic Quality Metric for Comparing Generative Large Language Models

Optimizing the role of human evaluation in LLM-based spoken document summarization systems

Automated Evaluation of Personalized Text Generation using Large Language Models

Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

Plain language summaries: A systematic review of theory, guidelines and empirical research.

How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation

RepEval: Effective Text Evaluation with LLM Representation

SummEval: Re-evaluating Summarization Evaluation

An In-depth Evaluation of GPT-4 in Sentence Simplification with Error-based Human Assessment

Can Large Language Models Serve as Evaluators for Code Summarization?

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores

FineSurE: Fine-grained Summarization Evaluation using LLMs

DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering

From task to evaluation: an automatic text summarization review