Abstract:How can we interpret and retrieve medical evidence to support clinical decisions? Clinical trial reports (CTR) amassed over the years contain indispensable information for the development of personalized medicine. However, it is practically infeasible to manually inspect over 400,000+ clinical trial reports in order to find the best evidence for experimental treatments. Natural Language Inference (NLI) offers a potential solution to this problem, by allowing the scalable computation of textual entailment. However, existing NLI models perform poorly on biomedical corpora, and previously published datasets fail to capture the full complexity of inference over CTRs. In this work, we present a novel resource to advance research on NLI for reasoning on CTRs. The resource includes two main tasks. Firstly, to determine the inference relation between a natural language statement, and a CTR. Secondly, to retrieve supporting facts to justify the predicted relation. We provide NLI4CT, a corpus of 2400 statements and CTRs, annotated for these tasks. Baselines on this corpus expose the limitations of existing NLI models, with 6 state-of-the-art NLI models achieving a maximum F1 score of 0.627. To the best of our knowledge, we are the first to design a task that covers the interpretation of full CTRs. To encourage further work on this challenging dataset, we make the corpus, competition leaderboard, website and code to replicate the baseline experiments available at: <a class="link-external link-https" href="https://github.com/ai-systems/nli4ct" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of natural language inference (NLI) in clinical trial reports (CTR), especially how to efficiently interpret and retrieve medical evidence to support clinical decision - making. Specifically, the paper proposes two main tasks: 1. **Text Entailment Task**: Determine the inferential relationship between a natural - language statement and a clinical trial report. This task requires the model to be able to understand the content of the clinical trial report and judge whether the given statement can be derived from the report (i.e., judge whether there is an entailment or contradiction relationship between the two). 2. **Evidence Extraction Task**: Extract facts from the clinical trial report that support the above - mentioned inferential relationship. This task requires the model not only to be able to judge the relationship between the statement and the report, but also to find specific parts in the report as evidence to support this judgment. ### Background and Motivation - **Importance of Clinical Trial Reports**: Clinical trial reports have accumulated a large amount of information about the effectiveness and safety of new treatment methods, and this information is crucial for the development of personalized medicine. - **Limitations of Manual Review**: It is impractical to manually review more than 400,000 clinical trial reports to find the best evidence. - **Deficiencies of Existing NLI Models**: Existing NLI models perform poorly when dealing with biomedical corpora, and existing datasets also cannot fully capture the complexity of reasoning about clinical trial reports. ### Main Contributions of the Paper 1. **Defined a New Benchmark** (NLI4CT), including two main tasks that cover multiple fundamental challenges faced by modern NLI systems. 2. **Released a New Public Corpus**, which contains 2,400 expert - annotated entailment relationships, as well as related clinical trial reports, labels, and lists of extracted evidence. 3. **Conducted an Extensive Empirical Evaluation of Existing NLI Models**, demonstrating the limitations and challenges of current models in solving the proposed tasks. 7 representative NLI models were tested, and the highest F1 score was 0.644. ### Main Challenges - **Biomedical Reasoning**: Including dealing with abbreviations, synonyms, taxonomic relationships, and domain knowledge. - **Common - sense Reasoning**: Including coreference resolution and general world knowledge. - **Numerical Reasoning**: Involving comparisons of doses, frequencies, and percentages, and usually requiring unit conversion. ### Experimental Results - **Performance of Baseline Models**: In task 1, the best - performing models are BioBERT and BioMegatron, with F1 scores both exceeding 0.644. However, these models show severe over - fitting on the training set. - **Evidence - Only Baseline**: The baseline model generated using only the gold evidence as a premise does not significantly improve performance, indicating that even after 10 rounds of training, the model does not effectively derive conclusions from relevant evidence. - **Statistical Artifacts**: By using only statements without accessing clinical trial reports, it is found that the model relies on superficial statistical artifacts rather than learning the underlying rules of the task. ### Conclusion This paper provides a new benchmark for natural language inference research in the biomedical field by proposing the NLI4CT dataset and two tasks. Although existing NLI models have shown certain capabilities in some aspects, they still face many challenges, especially when dealing with complex biomedical and numerical reasoning tasks. Future research needs to further improve the models to better meet these challenges.

NLI4CT: Multi-Evidence Natural Language Inference for Clinical Trial Reports

SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

Inferring Which Medical Treatments Work from Reports of Clinical Trials

THiFLY Research at SemEval-2023 Task 7: A Multi-granularity System for CTR-based Textual Entailment and Evidence Retrieval

SEME at SemEval-2024 Task 2: Comparing Masked and Generative Language Models on Natural Language Inference for Clinical Trials

Evidence Inference 2.0: More Data, Better Models

Jointly Extracting Interventions, Outcomes, and Findings from RCT Reports with LLMs

Curing the SICK and Other NLI Maladies

OCNLI: Original Chinese Natural Language Inference

MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference

TLDR at SemEval-2024 Task 2: T5-generated clinical-Language summaries for DeBERTa Report Analysis

IITK at SemEval-2024 Task 2: Exploring the Capabilities of LLMs for Safe Biomedical Natural Language Inference for Clinical Trials

Natural Language Inference in Context -- Investigating Contextual Reasoning over Long Texts

Understanding Clinical Trial Reports: Extracting Medical Entities and Their Relations

Research on judgment reasoning using natural language inference in Chinese medical texts

Team IELAB at TREC Clinical Trial Track 2023: Enhancing Clinical Trial Retrieval with Neural Rankers and Large Language Models

Information Extraction from Clinical Notes: Are We Ready to Switch to Large Language Models?

A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

Cross-lingual Natural Language Processing on Limited Annotated Case/Radiology Reports in English and Japanese: Insights from the Real-MedNLP Workshop

ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts

Predicting Clinical Trial Results by Implicit Evidence Integration