Abstract:We introduce the task of automatically revising scientific papers based on peer feedback and release ARIES, a dataset of review comments and their corresponding paper edits. The data is drawn from real reviewer-author interactions from computer science, and we provide labels linking each reviewer comment to the specific paper edits made by the author in response. We automatically create a high-precision silver training set, as well as an expert-labeled test set that shows high inter-annotator agreement. In experiments with 10 models covering the state of the art, we find that they struggle even to identify which edits correspond to a comment -- especially when the relationship between the edit and the comment is indirect and requires reasoning to uncover. We also extensively analyze GPT-4's ability to generate edits given a comment and the original paper. We find that it often succeeds on a superficial level, but tends to rigidly follow the wording of the feedback rather than the underlying intent, and lacks technical details compared to human-written edits.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of automatically revising scientific papers based on peer - review feedback. Specifically, the authors propose two main tasks: 1. **Comment - Edit Alignment**: - The goal is to identify which specific editing operations correspond to a given review comment. - The input is a review comment and a set of candidate edits (including the original text and the revised text), and the output is a binary classification for each comment - edit pair, indicating whether the comment and the edit are related. 2. **Edit Generation**: - The goal is to automatically generate appropriate edit content based on a given review comment. - The input is a review comment and the original paper text, and the output is the generated edit content. These edits should be able to respond to the reviewer's feedback and remain coherent in the context of the paper. ### Research Background and Challenges Existing natural language processing (NLP) systems can generate fluent and coherent texts, but still have limited performance in complex writing tasks that require interpretation and reasoning. Especially in the field of scientific writing, these tasks are more challenging because they require in - depth expertise and reasoning ability. In addition, the data sets used to study such tasks are very limited, and scientific writing, as a specific field, is particularly lacking in relevant high - quality data sets. ### ARIES Data Set To solve these problems, the authors constructed the ARIES (Aligned, Review - Informed Edits of Scientific Papers) data set. This data set contains real - life computer science paper drafts obtained from the OpenReview platform, the corresponding review feedback, and the authors' responses and revisions. Through automatic methods and expert annotation, the ARIES data set provides high - precision comment - edit alignment data, thus supporting the training and evaluation of model performance in substantial editing tasks in the technical field. ### Experimental Results Through experimental evaluations of multiple baseline models, the authors found that even large - language models such as GPT - 4 face challenges in the comment - edit alignment task, especially when dealing with comments and edits with indirect relationships. For the edit generation task, GPT - 4 can generate superficially coherent and relevant edits, but often fails to capture the deep - seated intentions of the feedback, lacks technical details, and rarely generates edits that refute the review comments. ### Main Contributions 1. Proposed new tasks: Aligning high - level draft feedback with specific edits and generating revisions of scientific papers based on review feedback. 2. Constructed the ARIES data set, which contains 3.9K automatically matched review comments and edits, as well as a carefully annotated test set (196 manually annotated comments). 3. Evaluated a wide range of baseline methods and found that even modern large - language models (such as GPT - 4) perform poorly in the comment - edit alignment task. 4. Conducted a detailed analysis of GPT - 4's performance in the edit generation task, revealed the systematic differences between the generated edits and the real edits, and pointed out future research directions. ### Conclusion This paper promotes the research progress of automated scientific paper revision by introducing the ARIES data set and proposing new tasks. It not only shows the limitations of existing models but also provides a clear direction for future improvements.

ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews

arXivEdits: Understanding the Human Revision Process in Scientific Writing

A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications

Can We Automate Scientific Reviewing?

Automatically Annotating Articles Towards Opening and Reusing Transparent Peer Reviews

ArgRewrite V.2: an Annotated Argumentative Revisions Corpus

ReAct: A Review Comment Dataset for Actionability (and more)

Regression of severe corneal neovascularization after a triple procedure: phacoemulsification, intraocular lens implantation, and Descemet-stripping automated endothelial keratoplasty.

MARG: Multi-Agent Review Generation for Scientific Papers

Revise and Resubmit: An Intertextual Model of Text-based Collaboration in Peer Review

CASIMIR: A Corpus of Scientific Articles enhanced with Multiple Author-Integrated Revisions

NLPeer: A Unified Resource for the Computational Study of Peer Review

Automated Focused Feedback Generation for Scientific Writing Assistance

Automated Peer Reviewing in Paper SEA: Standardization, Evaluation, and Analysis

Can large language models provide useful feedback on research papers? A large-scale empirical analysis

What Can Natural Language Processing Do for Peer Review?

Peer Reviewing Revisited: Assessing Research with Interlinked Semantic Comments

AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews

Re3: A Holistic Framework and Dataset for Modeling Collaborative Document Revision

Automated scholarly paper review: Concepts, technologies, and challenges

MOPRD: A multidisciplinary open peer review dataset