Abstract:\textbf{Objective:} We aimed to develop an advanced multi-task large language model (LLM) framework to extract multiple types of information about dietary supplements (DS) from clinical records.
\textbf{Methods:} We used four core DS information extraction tasks - namely, named entity recognition (NER: 2,949 clinical sentences), relation extraction (RE: 4,892 sentences), triple extraction (TE: 2,949 sentences), and usage classification (UC: 2,460 sentences) as our multitasks. We introduced a novel Retrieval-Augmented Multi-task Information Extraction (RAMIE) Framework, including: 1) employed instruction fine-tuning techniques with task-specific prompts, 2) trained LLMs for multiple tasks with improved storage efficiency and lower training costs, and 3) incorporated retrieval augmentation generation (RAG) techniques by retrieving similar examples from the training set. We compared RAMIE's performance to LLMs with instruction fine-tuning alone and conducted an ablation study to assess the contributions of multi-task learning and RAG to improved multitasking performance.
\textbf{Results:} With the aid of the RAMIE framework, Llama2-13B achieved an F1 score of 87.39 (3.51\% improvement) on the NER task and demonstrated outstanding performance on the RE task with an F1 score of 93.74 (1.15\% improvement). For the TE task, Llama2-7B scored 79.45 (14.26\% improvement), and MedAlpaca-7B achieved the highest F1 score of 93.45 (0.94\% improvement) on the UC task. The ablation study revealed that while MTL increased efficiency with a slight trade-off in performance, RAG significantly boosted overall accuracy.
\textbf{Conclusion:} This study presents a novel RAMIE framework that demonstrates substantial improvements in multi-task information extraction for DS-related data from clinical records. Our framework can potentially be applied to other domains.
Computation and Language,Artificial Intelligence,Computational Engineering, Finance, and Science
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to extract multiple types of information about dietary supplements (DS) in clinical records. Specifically, the paper aims to develop an advanced multi - task large - language - model (LLM) framework to efficiently extract relevant information about dietary supplements from clinical records. These problems include:
1. **Named Entity Recognition (NER)**: Identify and classify dietary - supplement entities and adverse events (AEs) in the text. For example, in the sentence "The patient reported taking cranberry juice for a urinary tract infection", the model needs to mark "cranberry juice" as a dietary supplement and "urinary tract infection" as an adverse event.
2. **Relation Extraction (RE)**: Determine the relationships between the identified entities. For example, in the sentence "The patient experienced nausea after taking ginseng", the model needs to identify the negative relationship between "ginseng" and "nausea".
3. **Triple Extraction (TE)**: Structure the information into "subject - predicate - object" triples. For example, in the sentence "Cranberry is used to prevent urinary tract infections", the model needs to extract the triple (Cranberry, has_indication, urinary tract infections).
4. **Usage Classification (UC)**: Classify the usage status (such as start, continue, stop or uncertain) of dietary supplements described in clinical records. For example, in the sentence "The patient stopped taking fish oil due to side effects", the model needs to classify the usage status as "stop".
### Background and Challenges
Dietary supplements play an important role in promoting health and wellness, but there are many problems with their quality and safety. Since dietary supplements are classified as food rather than drugs, they are not strictly regulated by the FDA, which leads to insufficient ingredient transparency, lack of rigorous clinical trials and mechanism research, and thus may cause adverse events. Clinical records contain a large amount of information about dietary supplements and their adverse events, which is of great value for public health, medical research and regulation. However, this information is usually embedded in the unstructured text of electronic health records, and advanced information - extraction methods are required to comprehensively and accurately identify relevant entities, events and their relationships.
### Limitations of Existing Research
Although some studies have attempted to use natural - language - processing (NLP) techniques to analyze dietary supplements in text, these methods still have limitations when dealing with complex clinical texts and multiple entity types and relationships. For example, Bi - LSTM and BERT models perform poorly when dealing with unseen texts or complex clinical texts. Recently, large - scale language models (LLMs) such as GPT and Llama series have made significant progress in the field of artificial intelligence and have shown effectiveness in health - record and information - extraction tasks. However, the application of these models in dietary - supplement - related information extraction is still in the exploration stage.
### Main Contributions of the Paper
1. **First Exploration**: This is the first exploration of the potential of LLMs in multi - task information extraction of dietary supplements, covering NER, RE, TE and UC tasks.
2. **Proposing the RAMIE Framework**: A retrieval - enhanced multi - task information - extraction framework (RAMIE) is proposed, which improves extraction accuracy, model efficiency and scalability through multi - task learning (MTL), retrieval - enhanced generation (RAG) and instruction fine - tuning.
3. **Comprehensive Experiments**: On 8 state - of - the - art...