H-VAE: A Hybrid Variational AutoEncoder with Data Augmentation in Predicting CRISPR/Cas9 Off-target

Weiming Xiang,Dong Chen,Yingbo Cui,Shaoliang Peng
DOI: https://doi.org/10.1109/bibm52615.2021.9669570
2021-01-01
Abstract:CRISPR/Cas9-based gene editing technology has been widely used in various cells and organisms. However, the off-target effects will bring unpredictable consequences to the organism edited. One of the main obstacles to predict CRISPR/Cas9 off-target is the imbalance of the number of positive and negative samples, which puts forward a challenge for the training of traditional deep learning algorithms. In this paper, we proposed H-VAE, a hybrid variational autoencoder model with data augmentation. This model can extract more abundant sgRNA-DNA base pair matching information, and reduce the risk of overfitting. Moreover, the sample imbalance is resolved. H-VAE can make use of underlying information of training sample, extracted by VAE, to alleviate data-imbalance problem. In view of the weak ability to extract base pair matching information of existing models, a different encoding scheme based on pair encoding is proposed, which enables the model to make full use of sgRNA-DNA base pair matching information. On the Mismatch data set, compared with DeepCRISPR, the ROC-AUC and PR-AUC increased by 0.6% and 41.9%, respectively. In the new Indels data set test scenario, compared with CRISPR-Net, the ROC-AUC and PR-AUC were increased by 1.5% and 133.4% respectively. This proves that H-VAE can improve off-target prediction in various scenarios. The improvement of PR-AUC shows that H-VAE can significantly improve the effect of unbalanced classification. The experimental results demonstrate that H-VAE could achieve a better effect compared with state-of-the-art CRISPR/Cas9 off-target methods on various types of data sets. The code and data can be obtained at https://github.com/weimingxiang/H-VAE.
What problem does this paper attempt to address?