PJama Stores and Suffix Tree Indexing for Bioinformatics Applications

Ela Pustulka-Hunt
Abstract:Motivation: The biggest public domain biological sequence archive exceeds 6Gbases of DNA 1 and much larger sequence amounts are held by industrial labs. The amount of data is growing exponentially but sequence search technologies still rely on at le storage and high-throughput parallel computers reading all data sequentially to nd sequence similarities or patterns. This issue is not addressed by existing database technologies. Results: We explored DNA and protein sequence indexing using transient and persistent suux trees and tested our retrieval methods with human, worm and bacterial DNA, and protein data sets. Our index structure is designed in Java and takes advantage of orthogonal persistence for Java, PJama. Our exact sequence search methods deliver excellent performance and will complement our existing genome map applets by showing sequence query hits in genomic context.
What problem does this paper attempt to address?