SimCSum: Joint Learning of Simplification and Cross-lingual Summarization for Cross-lingual Science Journalism

Mehwish Fatima,Tim Kolber,Katja Markert,Michael Strube
2023-04-04
Abstract:Cross-lingual science journalism generates popular science stories of scientific articles different from the source language for a non-expert audience. Hence, a cross-lingual popular summary must contain the salient content of the input document, and the content should be coherent, comprehensible, and in a local language for the targeted audience. We improve these aspects of cross-lingual summary generation by joint training of two high-level NLP tasks, simplification and cross-lingual summarization. The former task reduces linguistic complexity, and the latter focuses on cross-lingual abstractive summarization. We propose a novel multi-task architecture - SimCSum consisting of one shared encoder and two parallel decoders jointly learning simplification and cross-lingual summarization. We empirically investigate the performance of SimCSum by comparing it with several strong baselines over several evaluation metrics and by human evaluation. Overall, SimCSum demonstrates statistically significant improvements over the state-of-the-art on two non-synthetic cross-lingual scientific datasets. Furthermore, we conduct an in-depth investigation into the linguistic properties of generated summaries and an error analysis.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in cross - language scientific news generation. Specifically, the goal is to generate scientific abstracts in the target language from scientific literature in the source language. These abstracts need to simplify the language complexity so that non - expert readers can understand them. The paper proposes a new multi - task learning architecture - SIMCSUM, which improves the quality of cross - language scientific abstracts by jointly training two high - level natural language processing tasks: text simplification and cross - language summarization. SIMCSUM consists of a shared encoder and two parallel decoders, which are respectively responsible for the simplification and cross - language summarization tasks. In this way, SIMCSUM aims to generate more coherent, understandable and target - language - habit - compliant scientific abstracts to meet the needs of non - expert readers. ### Main contributions of the paper: 1. **Introduction of SIMCSUM**: This is a multi - task learning model that improves the quality of cross - language scientific abstracts by jointly training text simplification and cross - language summarization tasks. At the same time, the paper also introduces a strong baseline model - Simplify - Then - Summarize for performance comparison. 2. **Empirical evaluation**: SIMCSUM was empirically evaluated on two cross - language scientific datasets and compared with existing cross - language summarization models. In addition, a human evaluation was also carried out to analyze the language quality of the generated abstracts. 3. **In - depth analysis**: A detailed analysis was carried out on various lexical, readability and syntactic features of the generated abstracts, and an error analysis was also carried out to evaluate the output quality. ### Overview of the paper structure: - **Introduction**: Introduced the background and requirements of cross - language scientific news, especially the requirements of "Spektrum der Wissenschaft", the German version of "Scientific American". - **Related work**: Reviewed the relevant research on scientific abstracts, cross - language abstracts and monolingual scientific news. - **Proposed model**: Described in detail the architecture and training method of SIMCSUM. - **Experiments**: Introduced the used datasets, experimental settings, baseline models and experimental results. - **Results**: Showed the performance of SIMCSUM on the WIKIPEDIA and SPEKTRUM datasets and carried out statistical significance tests. - **Analysis**: Conducted an in - depth analysis of the generated abstracts in terms of lexical diversity, readability and syntactic features. ### Experimental results: - **Automatic evaluation**: SIMCSUM outperforms the baseline model on multiple evaluation metrics (such as ROUGE, BERT - score, Flesch Kincaid Reading Ease). - **Human evaluation**: SIMCSUM scores higher in terms of fluency, relevance and conciseness, indicating that the abstracts it generates are more in line with the needs of non - professional readers. - **In - depth analysis**: SIMCSUM performs well in terms of lexical diversity and syntactic structure, especially in generating shorter and simpler sentences. ### Conclusion: SIMCSUM significantly improves the quality of cross - language scientific abstracts by jointly training text simplification and cross - language summarization tasks, making them more coherent, easier to understand and suitable for non - professional readers. This result is of great significance for automated scientific news generation.