Abstract:Many scientific investigations require data-intensive research where big data are collected and analyzed. To get big insights from big data, we need to first develop our initial hypotheses from the data and then test and validate our hypotheses about the data. Visualization is often considered a good means to suggest hypotheses from a given dataset. Computational algorithms, coupled with scalable computing, can perform hypothesis testing with big data. Furthermore, interactive visual interfaces can allow domain experts to directly interact with data and participate in the loop to refine their research questions and redirect their research directions. In this paper we discuss a framework that integrates information visualization, scalable computing, and user interfaces to explore large-scale multi-modal data streams. Discovering new knowledge from the data requires the means to exploratively analyze datasets of this scale—allowing us to freely “wander” around the data, and make discoveries by combining bottom-up pattern discovery and top-down human knowledge to leverage the power of the human perceptual system. We start with a novel interactive temporal data mining method that allows us to discover reliable sequential patterns and precise timing information of multivariate time series. We then proceed to a parallelized solution that can fulfill the task of extracting reliable patterns from large-scale time series using iterative MapReduce tasks. Our work exploits visual-based information technologies to allow scientists to interactively explore, visualize and make sense of their data. For example, the parallel mining algorithm running on HPC is accessible to users through asynchronous web service. In this way, scientists can compare the intermediate data to extract and propose new rounds of analysis for more scientifically meaningful and statistically reliable patterns, and therefore statistical computing and visualization can bootstrap each another. Furthermore, visual interfaces in the framework allows scientists to directly participate in the loop and can redirect the analysis direction. All these combine to reveal an effective and efficient way to perform closed-loop big data analysis with visualization and scalable computing.

Parallel Approach and Platform for Large-Scale WEB Data Extraction

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Implementation of large-scale distributed information retrieval system

A Distributed Data Mining System Framework for Mobile Internet Access Log Based on Hadoop.

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

A novel agent-based parallel ETL system for massive data

Research on method for extracting large-scale social network based on Mapreduce

A Parallel Pages Mining Approach: Combining URL Patterns and HTML Structures

Evaluating Large Graph Processing in MapReduce Based on Message Passing

Parallel Image Texture Feature Extraction Under Hadoop Cloud Platform

Web-scale extraction of structured data

An Asynchronous Iteration Approach for Processing on Web Data Warehouse

Parallelization in Extracting Fresh Information from Online Social Network

Power Big Data Analysis Platform Design Based on Hadoop

Closed-loop Big Data Analysis with Visualization and Scalable Computing

Web data extraction, applications and techniques: A survey

Research on Parallel Duplicated Webpage Deletion Based on MapReduce Model

Block Storage Optimization and Parallel Data Processing and Analysis of Product Big Data Based on the Hadoop Platform

Web Content Extraction & Its Data Management Method

Investigation of Parallel Data Mining

A Hybrid Method for Web Data Extraction