Data Augmentation on Problem and Method Sentence Classification Task in Scientific Paper: A Mechanism Analysis Study.

Yingyi Zhang,Chengzhi Zhang
DOI: https://doi.org/10.1007/978-3-031-57867-0_2
2024-01-01
Abstract:Billions of scientific papers lead to the need to identify essential parts of the massive text. Scientific research is an activity from putting forward problems to using methods. To learn the main idea from scientific papers, we focus on extracting problem and method sentences. Annotating sentences in scientific papers is labor-intensive, resulting in the creation of small-scale datasets that limit model learning. To tackle this challenge, data augmentation has been adopted due to its ability to generate synthetic data with minor variations, thereby expanding the scale of the original training dataset. Nowadays, there are various data augmentation methods, such as those based on random word replacement or back translation. Nevertheless, their suitability for sentence classification tasks in scientific papers remains unexplored. Thus, this paper constructs two manually annotation datasets and evaluates their performance. Furthermore, this paper delves into the mechanisms underlying their effects. Previous studies have suggested that data augmentation can diminish reliance on high-frequency patterns in models. Therefore, this paper employs attention values to represent the model's dependence on words and analyzes how data augmentation methods alter the attention values of individual words within sentences. The experimental results indicate that data augmentation methods can improve the macro F 1 score in sentence classification tasks. Furthermore, data augmentation methods effectively reduce the attention values assigned to stop words, commonly used words in scientific papers, and commonly used words in method and problem sentences.
What problem does this paper attempt to address?