Fast-Part: Fast and Accurate Data Partitioning for Biological Sequence Analysis

Shafayat Ahmed,Muhit Islam Emon,Nazifa Ahmed Moumi,LIQING ZHANG
DOI: https://doi.org/10.1101/2024.11.13.623463
2024-11-15
Abstract:Developing effective machine learning models for classifications of biological sequences depends heavily on the quality of the training and test datasets split. Existing tools are either computationally expensive, unable to maintain the desired level of similarity between the training and test datasets, or unable to retain training: test ratio stratification. Here, we present Fast-Part, a fast and accurate sequence data partitioning tool that ensures strict homology separation between the training and test datasets and the best possible training: test stratification ratio, and at the same time is computationally fast. Evaluation of Fast-Part on multiple protein sequence datasets shows that it performs data partitioning with exceptional speed and maintains strict partitioning compared to the existing tools. Fast-Part can handle massive datasets like CD-HIT[1] and MMseq[2] and maintain strict homology partitioning like GraphPart[3].
Bioinformatics
What problem does this paper attempt to address?