Computing the BWT and LCP array of a Set of Strings in External Memory

Paola Bonizzoni,Gianluca Della Vedova,Yuri Pirola,Marco Previtali,Raffaella Rizzi
DOI: https://doi.org/10.1016/j.tcs.2020.11.041
2020-12-04
Abstract:Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multistring generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes. In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings of different lengths. The algorithm over a set of strings having constant length k has O(mkl) time and I/O volume, using O(k + m) main memory, where l is the maximum value in the LCP array.
Data Structures and Algorithms
What problem does this paper attempt to address?