GenomicSign: A computational method to discover unique, specific, and amplifiable signatures of target genomic sequences

Prasanna Kumar S,Ashok Palaniappan
DOI: https://doi.org/10.1101/2024.11.05.622192
2024-11-07
Abstract:Molecular diagnostics for the rapid identification of infectious, virulent, and pathogenic organisms are key to health and global security. Such methods rely on the identification and detection of signatures possessed by the organism. In this work, we outline a computational algorithm, GenomicSign, to determine unique and amplifiable genomic signatures of a set of target sequences against a background set of non-target sequences. The set of target sequences might comprise variants of a pathogen of interest, say SARS-CoV2 virus. Unique k-mers of the consensus target sequence for a range of k-values are determined, and the threshold k-value yielding a sharp transition in the number of unique k-mers is identified as k . Corresponding unique k-mers for k ≥ k are compared against the set of non-target sequences to identify unique k-mers. A pair of proximal such k-mers could enclose a potential amplicon. Primers to such pairs are designed and scored using a custom scheme to rank the potential amplicons. The top-ranked resulting amplicons are candidates for unique and amplifiable genomic signatures. The entire workflow is demonstrated using a case study with the SARS-CoV2 omicron genome. A case study distinguishing the SARS-CoV2 omicron target strain against non-target other SARS-CoV2 variants is performed to illustrate the workflow. GenomicSign has been implemented in Python and is available as an open-source software under MIT Licence (https://www.github.com/apalania/GenomicSign).
Bioinformatics
What problem does this paper attempt to address?
This paper aims to solve the key problem of how to quickly identify the genomic characteristics of specific pathogens or virus variants. Specifically, the authors proposed a computational method named GenomicSign, which is used to discover unique, specific and amplifiable genomic signatures in the target genome sequence set, so as to distinguish it from the non - target genome sequence set. For example, this method can be used to identify the differences between specific variants of SARS - CoV - 2 (COVID - 19 virus), such as the Omicron variant, and other SARS - CoV - 2 variants. ### Main problems to be solved: 1. **Rapid identification**: Develop a fast and effective algorithm that can quickly identify the unique genomic characteristics of specific pathogens or virus variants in a large amount of genomic data. 2. **Specificity**: Ensure that the identified genomic characteristics are specific to the target pathogen or virus variant, rather than widely existing in other related species or variants. 3. **Amplifiability**: The identified genomic characteristics should have good PCR amplification performance for use in molecular diagnosis. ### Method overview: - **Determine the optimal k - value**: By analyzing the distribution of k - mers (k consecutive nucleotide fragments) of different lengths in the target genome, find an optimal k - value (denoted as \( k_{\text{opt}} \)), at which the number of unique k - mers in the target genome increases significantly. - **Screen specific k - mers**: Compare the specific k - mers with \( k\geq k_{\text{opt}} \) with the non - target genome, and screen out the specific k - mers that only exist in the target genome. - **Design primer pairs**: Among the screened specific k - mers, find a pair of k - mers that are close to each other as a potential PCR amplification region (amplicon). Design the corresponding primer pairs, and score and rank the primers according to a series of criteria (such as melting temperature, GC content, etc.). - **Verification and application**: Verify the effectiveness of the designed primer pairs through experiments, and apply them to actual molecular diagnosis. ### Case study: The paper shows the application of GenomicSign through a case study, that is, to identify the unique genomic characteristics of the SARS - CoV - 2 Omicron variant compared with other SARS - CoV - 2 variants. The results show that GenomicSign has successfully identified a potential amplification region of 81 base pairs, which is located in the coding region of the ORF1ab gene. ### Conclusion: The GenomicSign algorithm provides an effective method for identifying the unique genomic characteristics of specific pathogens or virus variants, and these characteristics can be used for molecular diagnosis and environmental monitoring. Future work will further verify the reliability and practicality of this method in clinical applications.