Building Big Data Processing and Visualization Pipeline through Apache Zeppelin

Yanzhe Cheng,Fang Cherry Liu,Shan Jing,Weijia Xu,Duen Horng Chau
DOI: https://doi.org/10.1145/3219104.3229288
2018-07-22
Abstract:Big data analytics pipeline becomes popular for large volume data processing, Apache Zeppelin provides an integrated environment for data ingestion, data discovery, data analytics and data visualization and collaboration with an extended framework which allows different programming languages and data processing back ends to be plugged in. The supported languages include Scala, Python, SQL, and Shell script as well as big data processing back ends including Hadoop, Spark and Hive. With the necessary tool sets, an interactive and dynamic data analysis can be done on the fly with heterogeneous programming interfaces. Although Zeppelin is great for code development and interactive analysis with small scale data set for proof-of-concept or use-case presentations, running the data processing pipeline in the batch mode is still needed for performance, robustness to fit in an automated workflow in some cases. We are developing a tool to convert Zeppelin notebook into a workflow with a set of codes that can run in a batch mode through command line interface without requiring running Zeppelin, so that the prototype code can be seamlessly deployed on the production cluster after demo stage. The entire workflow can be preserved, configured manually and run automatically. Zeppelin also provides a flexible way to integrate the visualization functionality, another contribution of this paper is to extend the Zeppelin's existing built-in visualization component for D3Network. With two added features described above, Zeppelin can help users to develop big data pipeline and visualizing graph data quickly and efficiently.
What problem does this paper attempt to address?