Generic Non-Recursive Suffix Array Construction

Jannik Olbrich,Enno Ohlebusch,Thomas Büchler
DOI: https://doi.org/10.1145/3641854
IF: 1.113
2024-02-08
ACM Transactions on Algorithms
Abstract:The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in 2015 is the only known non-recursive linear-time suffix array construction algorithm (SACA). Despite its interesting theoretical properties, there has been little effort in improving GSACA ’s non-competitive real-world performance. There is a super-linear algorithm DSH which relies on the same sorting principle and is faster than DivSufSort , the fastest SACA for over a decade. The purpose of this paper is twofold: We analyse the sorting principle used in GSACA and DSH and exploit its properties in order to give an optimised linear-time algorithm, and we show that it can be very elegantly used to compute both the original extended Burrows-Wheeler transform ( \(\mathsf {eBWT} \) ) and a bijective version of the Burrows-Wheeler transform ( \(\mathsf {BBWT} \) ) in linear time. We call the algorithm “generic” since it can be used to compute the regular suffix array and the variants used for the \(\mathsf {BBWT} \) and \(\mathsf {eBWT} \) . Our suffix array construction algorithm is not only significantly faster than GSACA but also outperforms DivSufSort and DSH . Our \(\mathsf {BBWT} \) -algorithm is faster than or competitive with all other tested \(\mathsf {BBWT} \) construction implementations on large or repetitive data, and our \(\mathsf {eBWT} \) -algorithm is faster than all other programs on data that is not extremely repetitive.
computer science, theory & methods,mathematics, applied
What problem does this paper attempt to address?