Abstract:Background: N4-methylcytosine (4mC) is one of the most widespread DNA methylation modifications, which plays an important role in DNA replication and repair, epigenetic inheritance, gene expression levels and regulation of transcription. Although biological experiments can identify potential 4mC modification sites, they are limited due to the experimental environment and labor intensive. Therefore, it is crucial to construct a computational model to identify the 4mC sites. background: N4-methylcytosine (4mC) is one of the most widespread DNA methylation modifications, which plays an important role in DNA replication and repair, epigenetic inheritance, gene expression levels and regulation of transcription. Although biological experiment can identify potential 4mC modification sites, it’s limited due to the experimental environment and labor intensive. Therefore, it is crucial to construct a computational model to identify the 4mC sites. Objective: Although some computational methods have been proposed to identify the 4mC sites, some problems should not be ignored, such as: (1) a large number of unknown nucleotides exist in the biological sequence; (2) a large number of zeros exist in the previous encoding technologies; (3) sequence distribution information is important to identify 4mC sites. Considering these aspects, we propose a computational model based on a novel encoding strategy with position specific information to identify 4mC sites. Methods: We constructed an accurate computational model i4mC-CPXG based on extreme gradient boosting. Two aspects of feature vectors are extracted according to nucleotide information and position specific information. From the aspect of nucleotide information, we used prior information to identify the base type of unknown nucleotide and decrease the influence of invalid information caused by lots of zeros. From the aspect of position specific information, the vector was designed carefully to express the base distribution and arrangement. Then the feature vector fused by nucleotide information and position specific information was input into extreme gradient boosting to construct the model. method: We constructed an accurate computational model i4mC-CPXG based on extreme gradient boosting. Two aspects feature vectors are extracted according to nucleotide information and position specific information. From the aspect of nucleotide information, we used prior information to identify the base type of unknown nucleotide and decrease the influence of invalid information caused by lots of zeros. From the aspect of position specific information, the vector was designed carefully to express the base distribution and arrangement. Then the feature vector fused by nucleotide information and position specific information was input into extreme gradient boosting to construct model. Results: The accuracy of i4mC-CPXG is 82.49% on independent dataset. The result was better than model i4mC-w2vec which was the best model in the imbalanced dataset with the ratio of 1:15. Meanwhile, our model achieved good performance on other species. These results validated the effectiveness of i4mC-CPXG. Conclusion: Our method is effective to identify potential 4mC modification sites due to the proposed new encoding strategy fused position specific information. The satisfactory prediction results of balanced datasets, imbalanced datasets and other species datasets indicate that i4mC-CPXG is valuable to provide a reasonable supplement for biology research. other: The satisfactory prediction results of balanced datasets, imbalanced datasets and other species datasets indicate that i4mC-CPXG is valuable to provide a reasonable supplement for biology research.

I4mc-Cpxg: A Computational Model for Identifying DNA N4- Methylcytosine Sites in Rosaceae Genome Using Novel Encoding Strategy

An Effective Algorithm Based on Sequence and Property Information for N4-methylcytosine Identification in Multiple Species

Identifying DNA N4-methylcytosine Sites in the Rosaceae Genome with a Deep Learning Model Relying on Distributed Feature Representation

DNA4mC-LIP: a Linear Integration Method to Identify N4-methylcytosine Site in Multiple Species

4Mcpred: Machine Learning Methods for DNA N4-methylcytosine Sites Prediction.

Using a hybrid neural network architecture for DNA sequence representation: A study on N4-methylcytosine sites

SOMM4mC: a Second-Order Markov Model for DNA N4-methylcytosine Site Prediction in Six Species

Computational Identification of N4-methylcytosine Sites in the Mouse Genome with Machine-Learning Method.

I4mc-El: Identifying DNA N4-Methylcytosine Sites in the Mouse Genome Using Ensemble Learning

Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species.

Hyb4mC: a Hybrid DNA2vec-based Model for DNA N4-methylcytosine Sites Prediction

Evaluation of Different Computational Methods on 5-Methylcytosine Sites Identification.

PSP-PJMI: an Innovative Feature Representation Algorithm for Identifying DNA N4-methylcytosine Sites

Identification of DNA Modification Sites Based on Elastic Net and Bidirectional Gated Recurrent Unit with Convolutional Neural Network

An Integrated Multi-Model Framework Utilizing Convolutional Neural Networks Coupled with Feature Extraction for Identification of 4mC Sites in DNA Sequences

Identification of DNA N4-methylcytosine Sites Via Multiview Kernel Sparse Representation Model

RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition

Idna-Ms: an Integrated Computational Tool for Detecting DNA Modification Sites in Multiple Genomes

Iterative Feature Representations Improve N4-methylcytosine Site Prediction.

Mus4mCPred: Accurate Identification of DNA N4-Methylcytosine Sites in Mouse Genome Using Multi-View Feature Learning and Deep Hybrid Network

A Deep Neural Network for Identifying DNA N4-Methylcytosine Sites