Modeling Primer-Template Interactions using BERT Tokenizer to Predict PCR Amplification with Attention-BiLSTM

Niloofar Latifian,Naghmeh Nazer,Amir Masoud Jafarpisheh,Babak Hossein Khalaj
DOI: https://doi.org/10.1101/2024.11.23.624986
2024-11-24
Abstract:Polymerase Chain Reaction (PCR) is a widely used molecular biology tech- nique to amplify DNA sequences. PCR amplification is affected by factors such as binding dynamics and primer-template interactions. This study aims to reduce the time and cost of the experiment by predicting PCR outcomes based on these factors. To achieve this, we first identify the most stable bind- ing sites for each primer-template pair by calculating the Gibbs free energy. Then, we propose a unique labelling strategy that captures primer-template interactions in the binding sites by analyzing match and mismatch positions. We categorize a set of English words into two semantically distinct groups: one for match positions and another for mismatch positions. Words within each group have a higher cosine similarity to one another than to words in the opposing group. We assign the corresponding word to each base pair based on whether it is a match or a mismatch. The labelled sequence is then tokenized with BERT, serving as input to an attention Bi-LSTM model. Achieving 96.3% accuracy, this approach significantly outperforms prior methods and pioneers BERT-based analysis in primer-template bindings.
Biology
What problem does this paper attempt to address?