Abstract:BackgroundDe novo transcriptome assembly is an important technique for understanding gene expression in non-model organisms. Many de novo assemblers using the de Bruijn graph of a set of the RNA sequences rely on in-memory representation of this graph. However, current methods analyse the complete set of read-derived k-mer sequence at once, resulting in the need for computer hardware with large shared memory.ResultsWe introduce a novel approach that clusters k-mers as the first step. The clusters correspond to small sets of gene products, which can be processed quickly to give candidate transcripts. We implement the clustering step using the MapReduce approach for parallelising the analysis of large datasets, which enables the use of compute clusters. The computational task is distributed across the compute system using the industry-standard MPI protocol, and no specialised hardware is required. Using this approach, we have re-implemented the Inchworm module from the widely used Trinity pipeline, and tested the method in the context of the full Trinity pipeline. Validation tests on a range of real datasets show large reductions in the runtime and per-node memory requirements, when making use of a compute cluster.ConclusionsOur study shows that MapReduce-based clustering has great potential for distributing challenging sequencing problems, without loss of accuracy. Although we have focussed on the Trinity package, we propose that such clustering is a useful initial step for other assembly pipelines.

EST Clustering in Large Dataset with MapReduce

Distributed Affinity Propagation Clustering Based on MapReduce

Parallel Subspace Clustering Using MapReduce

Parallel spectral clustering algorithm

An Enhanced Agglomerative Fuzzy K-Means Clustering Method with Mapreduce Implementation on Hadoop Platform

An efficient PAM spatial clustering algorithm based on MapReduce

K-Means Clustering with Bagging and MapReduce

Parallel algorithms of large-scale EST clustering:current progress

The performance of MapReduce: an in-depth study

The Performance of MapReduce

MapReduce Based Method for Big Data Semantic Clustering.

A 2-Tier Clustering Algorithm with Map-Reduce

Distributed structural clustering on large graph

Parallel Semi-Supervised Multi-Ant Colonies Clustering Ensemble Based on MapReduce Methodology.

Fast Clustering using MapReduce

Bwasw-Cloud: Efficient sequence alignment algorithm for two big data with MapReduce

K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity

MR-ELM: a MapReduce-based framework for large-scale ELM training in big data era

Large-scale Data Mining Method based on Clustering Algorithm Combined with MAPREDUCE

K-Means Parallel Algorithm of Big Data Clustering Based on Mapreduce PCAM Method

An Efficient K-Means Clustering Algorithm On Mapreduce