Abstract:Relation Extraction (RE) aims at recognizing the relation between pairs of entities mentioned in a text. Advances in LLMs have had a tremendous impact on NLP. In this work, we propose a textual data augmentation framework called PGA for improving the performance of models for RE in the scientific domain. The framework introduces two ways of data augmentation, utilizing a LLM to obtain pseudo-samples with the same sentence meaning but with different representations and forms by paraphrasing the original training set samples. As well as instructing LLM to generate sentences that implicitly contain information about the corresponding labels based on the relation and entity of the original training set samples. These two kinds of pseudo-samples participate in the training of the RE model together with the original dataset, respectively. The PGA framework in the experiment improves the F1 scores of the three mainstream models for RE within the scientific domain. Also, using a LLM to obtain samples can effectively reduce the cost of manually labeling data.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of data augmentation in the task of Relation Extraction (RE) in the scientific domain. Specifically, the paper proposes a data augmentation framework based on large language models (LLM) called PGA (Paraphrasing and Generating Augmentation) to improve the performance of RE models in the scientific domain. ### Background and Challenges 1. **High Cost of Data Annotation**: Obtaining large-scale, high-quality annotated data has always been a significant challenge for various tasks, especially in the field of scientific knowledge graphs. Scientific datasets often contain a large number of specialized terms and abbreviations of specific concepts, making it difficult for models to learn and generalize. 2. **Limitations of Existing Methods**: Existing RE methods mainly focus on low-resource settings, generating pseudo-samples through methods like Instruction Learning (ICL) and Chain of Thought (CoT). However, these methods may perform poorly in the scientific domain because new concepts and terms emerge every year, and the datasets used to train LLMs may not include these new concepts. 3. **Need for Data Augmentation**: To improve the performance of RE models, more training data is needed. Traditional data augmentation methods such as paraphrasing and generating can increase the quantity and diversity of training data, thereby enhancing the model's generalization ability. ### Solution 1. **PGA Framework**: The paper proposes the PGA framework, which uses LLM to generate two types of pseudo-samples: - **Paraphrasing**: Generating pseudo-samples with the same meaning but different expressions by paraphrasing the original training set samples. - **Generating**: Generating sentences with implicit corresponding label information through instructing LLM, based on the relations and entity information of the original training set samples. 2. **Utilization of Pseudo-Samples**: The generated pseudo-samples are used together with the original dataset to train the RE model, thereby improving the model's performance. 3. **Experimental Validation**: Experimental results show that the PGA framework significantly improves the F1 scores of three mainstream RE models in the scientific domain, and using LLM-generated samples can effectively reduce the cost of manual annotation. ### Main Contributions 1. **Generating High-Quality Pseudo-Samples**: The study shows that by designing simple yet effective prompts, LLM can generate high-quality, labeled pseudo-samples without manual annotation. 2. **Improving Model Performance**: The pseudo-samples generated by the PGA framework can significantly improve the F1 scores of multiple mainstream RE models. 3. **Reducing Annotation Costs**: Using LLM to generate samples can effectively reduce the cost of manual annotation, which is particularly important for data annotation in the scientific domain. ### Conclusion By proposing the PGA framework, this paper successfully addresses the issue of data augmentation in the task of relation extraction in the scientific domain, significantly improving model performance and reducing the cost of data annotation. This method provides a new solution for natural language processing tasks in the scientific domain.

PGA-SciRE: Harnessing LLM on Data Augmentation for Enhancing Scientific Relation Extraction

Making LLMs As Fine-Grained Relation Extraction Data Augmentor

GDA: Generative Data Augmentation Techniques for Relation Extraction Tasks.

Empowering Few-Shot Relation Extraction with The Integration of Traditional RE Methods and Large Language Models

Textual Data Augmentation for NER in Geosciences with LLMs

Using Augmented Small Multimodal Models to Guide Large Language Models for Multimodal Relation Extraction

Grasping the Essentials: Tailoring Large Language Models for Zero-Shot Relation Extraction

LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named Entity Recognition

GPT-RE: In-context Learning for Relation Extraction using Large Language Models

Relation Extraction with Fine-Tuned Large Language Models in Retrieval Augmented Generation Frameworks

Improving Relation Extraction with Relational Paraphrase Sentences.

Transfer Learning for Relation Extraction Via Relation-Gated Adversarial Learning

Graph-Augmented Relation Extraction Model with LLMs-Generated Support Document

Semi-supervised Relation Extraction via Data Augmentation and Consistency-training

Can We Have Both Fish and Bear's Paw? Improving Performance, Reliability, and both of them for Relation Extraction under Label Shift

Entity relation extraction in the medical domain: based on data augmentation

Enhancing Relation Extraction using Multi-Task Learning with SDP evidence

Relation Extraction in underexplored biomedical domains: A diversity-optimised sampling and synthetic data generation approach

Evidence-aware Document-level Relation Extraction