BATCH-SCAMPP: Batch Scaled Phylogenetic Placement Large Trees

Eleanor Wedell,Chengze Shen,Tandy Warnow
DOI: https://doi.org/10.1101/2022.10.26.513936
2024-11-19
Abstract:Phylogenetic placement, the problem of placing sequences into phylogenetic trees, has been limited either by the number of sequences placed in a single run or by the size of the placement tree. The most accurate scalable phylogenetic placement method with respect to the number of query sequences placed, EPA-ng, has a runtime that scales sublinearly to the number of query sequences. However, larger phylogenetic trees cause an increase in EPA-ng memory usage, limiting the method to placement trees of up to 10,000 sequences. Our recently designed SCAMPP framework has been shown to scale EPA-ng to larger placement trees of up to 200,000 sequences by building a subtree for the placement of each query sequence. The approach of SCAMPP does not take advantage of EPA-ng parallel efficiency since it only places a single query for each run of EPA-ng. Here we present BATCH-SCAMPP, a new technique that overcomes this barrier and enables EPA-ng and other phylogenetic placement methods to scale to ultra-large backbone trees and many query sequences. BATCH-SCAMPP is freely available at https://github.com/ewedell/BSCAMPP_code.
Biology
What problem does this paper attempt to address?