What problem does this paper attempt to address?

The problem that this paper attempts to solve is to perform safe biomedical natural language inference (NLI) in clinical trial reports (CTRs). The specific task is to classify statements regarding CTRs. The paper focuses on how to use large - language models (LLMs) to judge the entailment or contradiction relationships between CTRs and statements, especially maintaining the accuracy, faithfulness, and consistency of model inferences in complex reasoning in the medical field. ### Background of the Paper With the wide application of large - language models in natural language processing (NLP) tasks, these models have also made remarkable achievements in text entailment evaluation. However, they face challenges when dealing with domain - specific data, such as medical data, and are vulnerable to shortcut learning, factual inconsistency, and performance degradation. Therefore, Task 2 of SemEval - 2024 - Safe Biomedical Natural Language Inference (NLI4CT) - aims to evaluate the performance of LLMs in this specific medical task, especially the accuracy, faithfulness, and consistency of their inferences. ### Research Methods 1. **Model Selection**: The researchers selected the open - source Mistral - 7B model and quantified it and performed low - rank adaptation (LoRA) fine - tuning to improve the model's performance on the NLI4CT task. 2. **Data Augmentation**: To increase the diversity of the training data, the researchers created multiple training sets through manual annotation and automatic generation methods, including: - **Train_Manual**: Based on the original training set, new samples were generated through negation and rewriting. - **Train_Manual - Synthetic**: On the basis of Train_Manual, more samples were generated using automatic methods. - **Train_Full - Synthetic**: A large number of samples were completely generated by automatic methods. 3. **Instruction Fine - Tuning**: The researchers performed instruction fine - tuning on the Mistral - 7B model to better adapt to the NLI4CT task, using the supervised fine - tuning objective and the autoregressive language modeling method. 4. **Experimental Setup**: The researchers evaluated different prompt templates on the development set and selected the template with the best performance. Finally, the best model and training set were used to evaluate on the test set. ### Experimental Results - **Macro - F1 Score**: The macro - F1 score of the best model on the test set reached 0.80, ranking first on the leaderboard. - **Consistency**: The consistency score of the model was 0.72, ranking 15th. - **Faithfulness**: The faithfulness score of the model was 0.83, ranking 11th. ### Conclusions Although the model performs well in terms of classification accuracy, it performs poorly when faced with perturbations of statements (for example, predicting the same label for contradictory examples and different labels for synonymous rewriting examples). The researchers believe that by increasing high - quality manually - annotated samples, especially those for adversarial rewrites, the performance of the model can be further improved. Future work directions include exploring different models, optimizing prompt templates, adding more training data, and carefully constructing new training sets, with a focus on the intervention of statements rather than the number of basic statements. ### Formula Representation This paper does not involve complex mathematical, physical, chemical, or biological formulas, so there is no need to present formulas in a special Markdown format. However, if it is necessary to express simple mathematical concepts, such as calculating the F1 score, it can be expressed as follows: \[ \text{F1} = 2\times\frac{\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}} \] where: - **Precision** (Precision Rate): \( \text{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}} \) - **Recall** (Recall Rate): \( \text{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}} \)

Lisbon Computational Linguists at SemEval-2024 Task 2: Using A Mistral 7B Model and Data Augmentation

SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains

SEME at SemEval-2024 Task 2: Comparing Masked and Generative Language Models on Natural Language Inference for Clinical Trials

DFKI-NLP at SemEval-2024 Task 2: Towards Robust LLMs Using Data Perturbations and MinMax Training

Adapting LLMs for the Medical Domain in Portuguese: A Study on Fine-Tuning and Model Evaluation

A Framework to Assess Clinical Safety and Hallucination Rates of LLMs for Medical Text Summarisation

BioMistral-NLU: Towards More Generalizable Medical Language Understanding through Instruction Tuning

NLI4CT: Multi-Evidence Natural Language Inference for Clinical Trial Reports

SaulLM-7B: A pioneering Large Language Model for Law

MediAlbertina: An European Portuguese medical language model

SLIM-RAFT: A Novel Fine-Tuning Approach to Improve Cross-Linguistic Performance for Mercosur Common Nomenclature

Edinburgh Clinical NLP at SemEval-2024 Task 2: Fine-tune your model unless you have access to GPT-4

D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models

MedDoc-Bot: A Chat Tool for Comparative Analysis of Large Language Models in the Context of the Pediatric Hypertension Guideline

Edinburgh Clinical NLP at MEDIQA-CORR 2024: Guiding Large Language Models with Hints

Harmonising the Clinical Melody: Tuning Large Language Models for Hospital Course Summarisation in Clinical Coding

Zero-Shot LLMs for Named Entity Recognition: Targeting Cardiac Function Indicators in German Clinical Texts

Assessing The Potential Of Mid-Sized Language Models For Clinical QA

Large Language Models for Biomedical Text Simplification: Promising But Not There Yet

Towards Evaluating and Building Versatile Large Language Models for Medicine