Unified Programming Model and Software Framework for Big Data Machine Learning and Data Analytics.
Rong Gu,Yun Tang,Qianhao Dong,Zhaokang Wang,Zhiqiang Liu,Shuai Wang,Chunfeng Yuan,Yihua Huang
DOI: https://doi.org/10.1109/compsac.2015.275
2015-01-01
Abstract:In a new era of Big Data, the rapid growth of the applications, such as social media and web-search, requires efficient and scalable machine learning and statistical analytical algorithms. However, there lacks easy-to-use and efficient software frameworks or systems that can support fast development of such big data analytical algorithms. To solve these problems, we propose Octopus, an easy-to-use and efficient analytical system for big data. Octopus allows data analysts conduct complex data analytics for big data with traditional programming languages and methods in an easy and efficient way. To achieve the goal of ease-to-use, we propose a matrix-based unified programming model, which is the core of many data-intensive statistical applications such as numerical analysis and data mining. Further, in order to improve the performance, the Octopus software framework adopts various distributed computing platforms, including Hadoop MapReduce, Spark and MPI. On these computing platforms, we design several parallel matrix computation algorithms, which are suitable for various scenarios. Finally, the features of Octopus are encapsulated into a library with matrix-based APIs and exposed to users as an R package. R is a widely-used statistical programming language and supports diversified data analysis tasks through extension packages. Experimental results show that Octopus achieves efficient performance and near linear scalability.