Integrating Apache Spark and External Data Sources Using Hadoop Interfaces

Yong-Liang LI,Shu-Qiang YANG
DOI: https://doi.org/10.12783/dtetr/ssme-ist2016/3990
2016-01-01
DEStech Transactions on Engineering and Technology Research
Abstract:Requirements of data-processing are undergoing a profound transition with the dramatic increase of various application data. Along with this demands, sorts of storage systems with highly scalable are developed for large-scale data sources, and Apache Spark for immense amounts of data calculation has also captured attentions and excitements of the industry since release for its excellent performance. Technologies of combining Spark and external data sources have lots of potential to produce valuable data analysis platforms for solving a wide range of data-handling needs, accordingly it's very easy to shift from Hadoop to Spark application development since the industry pour a lot in these datastore system for Hadoop in the past and Spark can work well with Hadoop-supported systems. Paper delves into integration mechanism of spark and external data sources, such as NoSQL and relational databases (RDBs) and proposes a reference for tight and efficient integration of the two using Hadoop interfaces.
What problem does this paper attempt to address?