PGA-SciRE: Harnessing LLM on Data Augmentation for Enhancing Scientific Relation Extraction

Yang Zhou,Shimin Shan,Hongkui Wei,Zhehuan Zhao,Wenshuo Feng
2024-05-30
Abstract:Relation Extraction (RE) aims at recognizing the relation between pairs of entities mentioned in a text. Advances in LLMs have had a tremendous impact on NLP. In this work, we propose a textual data augmentation framework called PGA for improving the performance of models for RE in the scientific domain. The framework introduces two ways of data augmentation, utilizing a LLM to obtain pseudo-samples with the same sentence meaning but with different representations and forms by paraphrasing the original training set samples. As well as instructing LLM to generate sentences that implicitly contain information about the corresponding labels based on the relation and entity of the original training set samples. These two kinds of pseudo-samples participate in the training of the RE model together with the original dataset, respectively. The PGA framework in the experiment improves the F1 scores of the three mainstream models for RE within the scientific domain. Also, using a LLM to obtain samples can effectively reduce the cost of manually labeling data.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of data augmentation in the task of Relation Extraction (RE) in the scientific domain. Specifically, the paper proposes a data augmentation framework based on large language models (LLM) called PGA (Paraphrasing and Generating Augmentation) to improve the performance of RE models in the scientific domain. ### Background and Challenges 1. **High Cost of Data Annotation**: Obtaining large-scale, high-quality annotated data has always been a significant challenge for various tasks, especially in the field of scientific knowledge graphs. Scientific datasets often contain a large number of specialized terms and abbreviations of specific concepts, making it difficult for models to learn and generalize. 2. **Limitations of Existing Methods**: Existing RE methods mainly focus on low-resource settings, generating pseudo-samples through methods like Instruction Learning (ICL) and Chain of Thought (CoT). However, these methods may perform poorly in the scientific domain because new concepts and terms emerge every year, and the datasets used to train LLMs may not include these new concepts. 3. **Need for Data Augmentation**: To improve the performance of RE models, more training data is needed. Traditional data augmentation methods such as paraphrasing and generating can increase the quantity and diversity of training data, thereby enhancing the model's generalization ability. ### Solution 1. **PGA Framework**: The paper proposes the PGA framework, which uses LLM to generate two types of pseudo-samples: - **Paraphrasing**: Generating pseudo-samples with the same meaning but different expressions by paraphrasing the original training set samples. - **Generating**: Generating sentences with implicit corresponding label information through instructing LLM, based on the relations and entity information of the original training set samples. 2. **Utilization of Pseudo-Samples**: The generated pseudo-samples are used together with the original dataset to train the RE model, thereby improving the model's performance. 3. **Experimental Validation**: Experimental results show that the PGA framework significantly improves the F1 scores of three mainstream RE models in the scientific domain, and using LLM-generated samples can effectively reduce the cost of manual annotation. ### Main Contributions 1. **Generating High-Quality Pseudo-Samples**: The study shows that by designing simple yet effective prompts, LLM can generate high-quality, labeled pseudo-samples without manual annotation. 2. **Improving Model Performance**: The pseudo-samples generated by the PGA framework can significantly improve the F1 scores of multiple mainstream RE models. 3. **Reducing Annotation Costs**: Using LLM to generate samples can effectively reduce the cost of manual annotation, which is particularly important for data annotation in the scientific domain. ### Conclusion By proposing the PGA framework, this paper successfully addresses the issue of data augmentation in the task of relation extraction in the scientific domain, significantly improving model performance and reducing the cost of data annotation. This method provides a new solution for natural language processing tasks in the scientific domain.