SeqScreen: Accurate and Sensitive Functional Screening of Pathogenic Sequences via Ensemble Learning

Advait Balaji,Bryce Kille,Anthony D. Kappell,Gene D. Godbold,Madeline Diep,R. A. Leo Elworth,Zhiqin Qian,Dreycey Albin,Daniel J. Nasko,Nidhi Shah,Mihai Pop,Santiago Segarra,Krista L. Ternus,Todd J. Treangen
DOI: https://doi.org/10.1101/2021.05.02.442344
2021-05-02
Abstract:Abstract The COVID-19 pandemic has emphasized the importance of detecting known and emerging pathogens from clinical and environmental samples. However, robust characterization of pathogenic sequences remains an open challenge. To this end, we developed SeqScreen, which can accurately characterize short nucleotide sequences using taxonomic and functional labels, and a customized set of curated Functions of Sequences of Concern (FunSoCs) specific to microbial pathogenesis. We show our ensemble machine learning model can label protein-coding sequences with FunSoCs with high recall and precision. SeqScreen is a step towards a novel paradigm of functionally informed pathogen characterization and is available for download at: www.gitlab.com/treangenlab/seqscreen
What problem does this paper attempt to address?