Abstract:Many scientific investigations require data-intensive research where big data are collected and analyzed. To get big insights from big data, we need to first develop our initial hypotheses from the data and then test and validate our hypotheses about the data. Visualization is often considered a good means to suggest hypotheses from a given dataset. Computational algorithms, coupled with scalable computing, can perform hypothesis testing with big data. Furthermore, interactive visual interfaces can allow domain experts to directly interact with data and participate in the loop to refine their research questions and redirect their research directions. In this paper we discuss a framework that integrates information visualization, scalable computing, and user interfaces to explore large-scale multi-modal data streams. Discovering new knowledge from the data requires the means to exploratively analyze datasets of this scale—allowing us to freely “wander” around the data, and make discoveries by combining bottom-up pattern discovery and top-down human knowledge to leverage the power of the human perceptual system. We start with a novel interactive temporal data mining method that allows us to discover reliable sequential patterns and precise timing information of multivariate time series. We then proceed to a parallelized solution that can fulfill the task of extracting reliable patterns from large-scale time series using iterative MapReduce tasks. Our work exploits visual-based information technologies to allow scientists to interactively explore, visualize and make sense of their data. For example, the parallel mining algorithm running on HPC is accessible to users through asynchronous web service. In this way, scientists can compare the intermediate data to extract and propose new rounds of analysis for more scientifically meaningful and statistically reliable patterns, and therefore statistical computing and visualization can bootstrap each another. Furthermore, visual interfaces in the framework allows scientists to directly participate in the loop and can redirect the analysis direction. All these combine to reveal an effective and efficient way to perform closed-loop big data analysis with visualization and scalable computing.

Detecting Associations in Large Dataset on MapReduce

Distributed Affinity Propagation Clustering Based on MapReduce

SuperMIC: Analyzing Large Biological Datasets in Bioinformatics with Maximal Information Coefficient

Large-scale Data Mining Method based on Clustering Algorithm Combined with MAPREDUCE

An improved parallel association rules algorithm based on MapReduce framework for big data

LI-MR: A Local Iteration Map/Reduce Model and Its Application to Mine Community Structure in Large-Scale Networks

A Parallel Computing Model for Large-Graph Mining with MapReduce.

Parallel Subspace Clustering Using MapReduce

Large-Scale Social Network Analysis Based on MapReduce

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Analyzing Large Biological Datasets with an Improved Algorithm for MIC

Evaluating Large Graph Processing in MapReduce Based on Message Passing

Closed-loop Big Data Analysis with Visualization and Scalable Computing

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Research on Association Rules Mining Algorithm Based on Hadoop-Taking Apriori as an Example

Parallel Link Prediction in Complex Network Using MapReduce

Xml Structural Similarity Search Using Mapreduce

Community structure mining in big data social media networks with MapReduce

Efficiently extracting frequent subgraphs using MapReduce

Distributed structural clustering on large graph

An efficient PAM spatial clustering algorithm based on MapReduce