A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature

Benjamin Nye,Junyi Jessy Li,Roma Patel,Yinfei Yang,Iain J. Marshall,Ani Nenkova,Byron C. Wallace

DOI: https://doi.org/10.48550/arXiv.1806.04185

2018-06-12

Abstract:We present a corpus of 5,000 richly annotated abstracts of medical articles describing clinical randomized controlled trials. Annotations include demarcations of text spans that describe the Patient population enrolled, the Interventions studied and to what they were Compared, and the Outcomes measured (the `PICO' elements). These spans are further annotated at a more granular level, e.g., individual interventions within them are marked and mapped onto a structured medical vocabulary. We acquired annotations from a diverse set of workers with varying levels of expertise and cost. We describe our data collection process and the corpus itself in detail. We then outline a set of challenging NLP tasks that would aid searching of the medical literature and the practice of evidence-based medicine.

Computation and Language

What problem does this paper attempt to address?

This paper aims to solve the problems of automatic processing and information extraction of randomized controlled trials (RCTs) in medical literature. Specifically, the authors constructed a corpus containing 5,000 medical article abstracts, which describe randomized controlled trials in detail. The annotations in the corpus include the labeling of text fragments of patient groups (Patient), interventions (Intervention), comparators (Comparator), and measurement results (Outcome), which are called "PICO" elements. These fragments are further annotated at a finer - grained level. For example, individual interventions are marked therein and mapped to a structured medical vocabulary. The main contributions of the paper are as follows: 1. **Dataset construction**: It provides a large number of annotated medical literature abstracts, especially for the annotation of PICO elements, which is helpful for the application of natural language processing (NLP) techniques in medical literature processing. 2. **Annotation strategy**: A hybrid crowdsourcing annotation strategy is adopted, using annotators with different levels of expertise and costs, from non - professionals to medical doctors, to obtain high - quality annotated data. 3. **NLP task definition**: Several challenging NLP tasks are proposed, such as identifying text fragments describing PICO elements in abstracts, extracting structured information from abstracts, and identifying redundant mentions of the same PICO element, which directly support the practice of evidence - based medicine (EBM). 4. **Baseline model**: Baseline models and corresponding experimental results are provided for the above - mentioned tasks, including the neural label model (LSTM - CRF) combining conditional random fields (CRF) and long - short - term memory networks (LSTM) for identifying text fragments of PICO elements. Through these works, the paper aims to accelerate the synthesis of biomedical evidence, improve the efficiency of medical literature search and organization, and thus promote the development of evidence - based medicine.

A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature

Towards Constructing a Corpus for Studying the Effects of Treatments and Substances Reported in PubMed Abstracts

A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

An annotated corpus of clinical trial publications supporting schema-based relational information extraction

Building a comprehensive syntactic and semantic corpus of Chinese clinical texts

Characterizing pituitary adenomas in clinical notes: Corpus construction and its application in LLMs

Inferring Which Medical Treatments Work from Reports of Clinical Trials

A unified framework of medical information annotation and extraction for Chinese clinical text

A Corpus for Detecting High-Context Medical Conditions in Intensive Care Patient Notes Focusing on Frequently Readmitted Patients

SemClinBr -- a multi institutional and multi specialty semantically annotated corpus for Portuguese clinical NLP tasks

CORAL: Expert-Curated medical Oncology Reports to Advance Language Model Inference

The Medical Scribe: Corpus Development and Model Performance Analyses

Building a Pediatric Medical Corpus: Word Segmentation and Named Entity Annotation

Named Entities in Medical Case Reports: Corpus and Experiments

A novel relay selection scheme for LTE-advanced system under delay and load constraints

Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis

Construction, evaluation, and application of an electronic medical record corpus for cerebral palsy rehabilitation

Large Language Models for Social Determinants of Health Information Extraction from Clinical Notes - A Generalizable Approach across Institutions

Building an OMOP common data model-compliant annotated corpus for COVID-19 clinical trials

GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines

Cross-lingual Natural Language Processing on Limited Annotated Case/Radiology Reports in English and Japanese: Insights from the Real-MedNLP Workshop