ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews

Mike D'Arcy,Alexis Ross,Erin Bransom,Bailey Kuehl,Jonathan Bragg,Tom Hope,Doug Downey
2024-08-06
Abstract:We introduce the task of automatically revising scientific papers based on peer feedback and release ARIES, a dataset of review comments and their corresponding paper edits. The data is drawn from real reviewer-author interactions from computer science, and we provide labels linking each reviewer comment to the specific paper edits made by the author in response. We automatically create a high-precision silver training set, as well as an expert-labeled test set that shows high inter-annotator agreement. In experiments with 10 models covering the state of the art, we find that they struggle even to identify which edits correspond to a comment -- especially when the relationship between the edit and the comment is indirect and requires reasoning to uncover. We also extensively analyze GPT-4's ability to generate edits given a comment and the original paper. We find that it often succeeds on a superficial level, but tends to rigidly follow the wording of the feedback rather than the underlying intent, and lacks technical details compared to human-written edits.
Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of automatically revising scientific papers based on peer - review feedback. Specifically, the authors propose two main tasks: 1. **Comment - Edit Alignment**: - The goal is to identify which specific editing operations correspond to a given review comment. - The input is a review comment and a set of candidate edits (including the original text and the revised text), and the output is a binary classification for each comment - edit pair, indicating whether the comment and the edit are related. 2. **Edit Generation**: - The goal is to automatically generate appropriate edit content based on a given review comment. - The input is a review comment and the original paper text, and the output is the generated edit content. These edits should be able to respond to the reviewer's feedback and remain coherent in the context of the paper. ### Research Background and Challenges Existing natural language processing (NLP) systems can generate fluent and coherent texts, but still have limited performance in complex writing tasks that require interpretation and reasoning. Especially in the field of scientific writing, these tasks are more challenging because they require in - depth expertise and reasoning ability. In addition, the data sets used to study such tasks are very limited, and scientific writing, as a specific field, is particularly lacking in relevant high - quality data sets. ### ARIES Data Set To solve these problems, the authors constructed the ARIES (Aligned, Review - Informed Edits of Scientific Papers) data set. This data set contains real - life computer science paper drafts obtained from the OpenReview platform, the corresponding review feedback, and the authors' responses and revisions. Through automatic methods and expert annotation, the ARIES data set provides high - precision comment - edit alignment data, thus supporting the training and evaluation of model performance in substantial editing tasks in the technical field. ### Experimental Results Through experimental evaluations of multiple baseline models, the authors found that even large - language models such as GPT - 4 face challenges in the comment - edit alignment task, especially when dealing with comments and edits with indirect relationships. For the edit generation task, GPT - 4 can generate superficially coherent and relevant edits, but often fails to capture the deep - seated intentions of the feedback, lacks technical details, and rarely generates edits that refute the review comments. ### Main Contributions 1. Proposed new tasks: Aligning high - level draft feedback with specific edits and generating revisions of scientific papers based on review feedback. 2. Constructed the ARIES data set, which contains 3.9K automatically matched review comments and edits, as well as a carefully annotated test set (196 manually annotated comments). 3. Evaluated a wide range of baseline methods and found that even modern large - language models (such as GPT - 4) perform poorly in the comment - edit alignment task. 4. Conducted a detailed analysis of GPT - 4's performance in the edit generation task, revealed the systematic differences between the generated edits and the real edits, and pointed out future research directions. ### Conclusion This paper promotes the research progress of automated scientific paper revision by introducing the ARIES data set and proposing new tasks. It not only shows the limitations of existing models but also provides a clear direction for future improvements.