Abstract:Background Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. Results In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. Conclusions The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.

KSI：a DNA sequence matching library for terabyte scale bio-data

A Fast Exact Pattern Matching Algorithm for Biological Sequences

Gene Sequence Alignment on a Public Computing Platform

DNA-SaM, a robust system for large-scale data storage

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

Analyzing large-scale DNA Sequences on Multi-core Architectures

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics

DNAscan: a fast, computationally and memory efficient bioinformatics pipeline for the analysis of DNA next-generation-sequencing data

High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

Gene sequence analysis model construction based on k-mer statistics

BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches

An Implementation of Parallel Accelerating System on Chip for DNA Sequence Matching

ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis

Molecular-level similarity search brings computing to DNA data storage

BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data

CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase

Parallel linear space algorithm for large-scale sequence alignment

KDE Bioscience: Platform for Bioinformatics Analysis Workflows.

diverse-seq: an application for alignment-free selecting and clustering biological sequences

GenSeq+: A Scalable High-Performance Accelerator for Genome Sequencing.

Massively Parallel Implementation of Sequence Alignment with Basic Local Alignment Search Tool Using Parallel Computing in Java Library