Abstract:There are many studies that require researchers to extract specific information from the published literature, such as details about sequence records or about a randomized control trial. While manual extraction is cost efficient for small studies, larger studies such as systematic reviews are much more costly and time-consuming. To avoid exhaustive manual searches and extraction, and their related cost and effort, natural language processing (NLP) methods can be tailored for the more subtle extraction and decision tasks that typically only humans have performed. The need for such studies that use the published literature as a data source became even more evident as the COVID-19 pandemic raged through the world and millions of sequenced samples were deposited in public repositories such as GISAID and GenBank, promising large genomic epidemiology studies, but more often than not lacked many important details that prevented large-scale studies. Thus, granular geographic location or the most basic patient-relevant data such as demographic information, or clinical outcomes were not noted in the sequence record. However, some of these data was indeed published, but in the text, tables, or supplementary material of a corresponding published article. We present here methods to identify relevant journal articles that report having produced and made available in GenBank or GISAID, new SARS-CoV-2 sequences, as those that initially produced and made available the sequences are the most likely articles to include the high-level details about the patients from whom the sequences were obtained. Human annotators validated the approach, creating a gold standard set for training and validation of a machine learning classifier. Identifying these articles is a crucial step to enable future automated informatics pipelines that will apply Machine Learning and Natural Language Processing to identify patient characteristics such as co-morbidities, outcomes, age, gender, and race, enriching SARS-CoV-2 sequence databases with actionable information for defining large genomic epidemiology studies. Thus, enriched patient metadata can enable secondary data analysis, at scale, to uncover associations between the viral genome (including variants of concern and their sublineages), transmission risk, and health outcomes. However, for such enrichment to happen, the right papers need to be found and very detailed data needs to be extracted from them. Further, finding the very specific articles needed for inclusion is a task that also facilitates scoping and systematic reviews, greatly reducing the time needed for full-text analysis and extraction.

Unrestricted Versus Regulated Open Data Governance: A Bibliometric Comparison of SARS-CoV-2 Nucleotide Sequence Databases

Mobilisation and analyses of publicly available SARS-CoV-2 data for pandemic responses

Profiling COVID-19 Genetic Research: A Data-Driven Study Utilizing Intelligent Bibliometrics.

Global landscape of SARS-CoV-2 genomic surveillance and data sharing

Following Data As It Crosses Borders During the COVID-19 Pandemic.

COVID-19: An exploration of consecutive systemic barriers to pathogen-related data sharing during a pandemic

Ethics and governance challenges related to genomic data sharing in southern Africa: the case of SARS-CoV-2

Landscape of SARS-CoV-2 genomic surveillance, public availability extent of genomic data, and epidemic shaped by variants: a global descriptive study

Ribonucleic acid (RNA) virus and coronavirus in Google Dataset Search: their scope and epidemiological correlation

How do we share data in COVID-19 research? A systematic review of COVID-19 datasets in PubMed Central Articles

Evaluating Trends in COVID-19 Research Activity in Early 2020: The Creation and Utilization of a Novel Open-Access Database

Global Landscape of SARS-CoV-2 Genomic Surveillance, Public Availability Extent of Genomic Data, and Epidemic Shaped by Variants.

Mitigating Biases in CORD-19 for Analyzing COVID-19 Literature

One-Year In: COVID-19 Research at the International Level in CORD-19 Data

What country, university or research institute, performed the best on COVID-19? Bibliometric analysis of scientific literature

The Canadian VirusSeq Data Portal & Duotang: open resources for SARS-CoV-2 viral sequences and genomic epidemiology

The Canadian VirusSeq Data Portal and Duotang: open resources for SARS-CoV-2 viral sequences and genomic epidemiology

A Bibliometric Analysis of COVID-19 across Science and Social Science Research Landscape

Open Data Resources for Fighting COVID-19

Open Government Data (OGD) sites and the sharing of country-specific real-time pandemic information: An investigation into COVID-19 datasets available on worldwide OGDs

Text mining biomedical literature to identify extremely unbalanced data for digital epidemiology and systematic reviews: dataset and methods for a SARS-CoV-2 genomic epidemiology study