ASAP-SML: An antibody sequence analysis pipeline using statistical testing and machine learning

Xinmeng Li,James A. Van Deventer,Soha Hassoun
DOI: https://doi.org/10.1371/journal.pcbi.1007779
2020-04-27
PLoS Computational Biology
Abstract:Antibodies are capable of potently and specifically binding individual antigens and, in some cases, disrupting their functions. The key challenge in generating antibody-based inhibitors is the lack of fundamental information relating sequences of antibodies to their unique properties as inhibitors. We develop a pipeline, Antibody Sequence Analysis Pipeline using Statistical testing and Machine Learning (ASAP-SML), to identify features that distinguish one set of antibody sequences from antibody sequences in a reference set. The pipeline extracts feature fingerprints from sequences. The fingerprints represent germline, CDR canonical structure, isoelectric point and frequent positional motifs. Machine learning and statistical significance testing techniques are applied to antibody sequences and extracted feature fingerprints to identify distinguishing feature values and combinations thereof. To demonstrate how it works, we applied the pipeline on sets of antibody sequences known to bind or inhibit the activities of matrix metalloproteinases (MMPs), a family of zinc-dependent enzymes that promote cancer progression and undesired inflammation under pathological conditions, against reference datasets that do not bind or inhibit MMPs. ASAP-SML identifies features and combinations of feature values found in the MMP-targeting sets that are distinct from those in the reference sets.The availability of machine learning techniques and the exponential growth of sequencing data presents new opportunities to identify features that endow antibodies with the ability to disrupt the functions of biological targets. We have created a pipeline that uses statistical testing and machine learning techniques to determine features that are overrepresented in a specified set of antibody sequences in comparison to a reference set. The pipeline is referred to as Antibody Sequence Analysis Pipeline using Statistical testing and Machine Learning (ASAP-SML). We demonstrate the use of ASAP-SML by analyzing sets of antibodies that inhibit matrix metalloproteinases (MMPs) against reference sets. ASAP-SML performs within and across set similarity analysis. As in prior studies, our analysis of these datasets shows that features associated with the antibody heavy chain are more likely to differentiate MMP-targeting antibody sequences from reference antibody sequences. Further, ASAP-SML identifies several features in the MMP-targeting set that are distinct from the reference sets. Using design recommendation trees, ASAP-SML suggests combinations of features that can be included or excluded to augment the targeting set with additional candidate MMP-targeting antibody sequences.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?