FTGD: a machine learning method for flowering-time gene prediction

Junyu Zhang,Siming He,Wenquan Wang,Fei Chen,Zhidong Liu
DOI: https://doi.org/10.48130/tp-2023-0023
2023-01-01
Abstract:The timing of flowering significantly affects plant reproduction and crop yield, making it important to detect flowering-time associated genes. In this study, we retrieved 628 flowering-time associated protein sequences from a database of flowering-time genes in <italic>Arabidopsis thaliana</italic> (FLOR-ID) and created seven machine learning models using Support Vector Machine (SVM) algorithms to discriminate flowering-time associated genes (FTAGs) from non-FTAGs. The SVM-Kmer-PC-PseAAC model performed the best (F1score = 0.934, accuracy = 0.939, and receiver operating characteristic = 0.943). Utilizing this model, we have developed a plant FTAGs prediction tool called "FTAGs_Find". We identified a total of 318,521 FTAGs from 81 species protein datasets using the FTAGs_Find. Notably, in <italic>O. lucimarinus</italic>, a non-flowering plant, only 208 FTAGs were predicted in the whole genome, accounting for just 2.68% of all genes, which is consist with the extensive FTAG loss during evolution. To facilitate user access to the FTAG prediction tool and the FTAG dataset, we constructed a plant flowering-time-associated genes database (FTAGdb), which will be a valuable resource for researchers and breeders.
What problem does this paper attempt to address?