SciReviewGen: A Large-scale Dataset for Automatic Literature Review Generation

Tetsu Kasanishi,Masaru Isonuma,Junichiro Mori,Ichiro Sakata
DOI: https://doi.org/10.48550/arXiv.2305.15186
2023-05-24
Computation and Language
Abstract:Automatic literature review generation is one of the most challenging tasks in natural language processing. Although large language models have tackled literature review generation, the absence of large-scale datasets has been a stumbling block to the progress. We release SciReviewGen, consisting of over 10,000 literature reviews and 690,000 papers cited in the reviews. Based on the dataset, we evaluate recent transformer-based summarization models on the literature review generation task, including Fusion-in-Decoder extended for literature review generation. Human evaluation results show that some machine-generated summaries are comparable to human-written reviews, while revealing the challenges of automatic literature review generation such as hallucinations and a lack of detailed information. Our dataset and code are available at https://github.com/tetsu9923/SciReviewGen.
What problem does this paper attempt to address?
The paper aims to address the challenging problem of automatic literature review generation. Specifically, the research team created a large-scale dataset, SciReviewGen, which contains over 10,000 literature reviews in the field of computer science and approximately 690,000 papers cited in these reviews. The previous lack of large-scale datasets made it difficult to apply supervised learning-based neural network summarization models to the task of literature review generation. To solve this bottleneck issue, the authors constructed SciReviewGen and evaluated the performance of recent Transformer-based summarization models on the literature review generation task using this dataset. In addition, the research team proposed an improved model called Query-weighted Fusion-in-Decoder (QFiD), which can weight the relevance of input documents based on queries (i.e., review titles and section titles). Experimental results show that QFiD outperforms baseline models in both automatic evaluations (such as ROUGE scores) and human evaluations, particularly excelling in generating more appropriate literature reviews by considering the relevance of cited papers to the queries. Through this research, the authors hope to advance the technology of automatic literature review generation and provide a benchmark dataset and model evaluation standards for future research.