Forest Cover Types Classification Based on Online Machine Learning on Distributed Cloud Computing Platforms of Storm and SAMOA

Guang Di Li,Guo Yin Wang,Xue Rui Zhang,Wei Hui Deng,Fan Zhang
DOI: https://doi.org/10.4028/www.scientific.net/amr.955-959.3803
2014-01-01
Advanced Materials Research
Abstract:Storm is the most popular realtime stream processing platform, which can be used to deal with online machine learning. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. SAMOA includes distributed algorithms for the most common machine learning tasks like Mahout for Hadoop. SAMOA is both a platform and a library. In this paper, Forest cover types, a large benchmaking dataset available at the UCI KDD Archive is used as the data stream source. Vertical Hoeffding Tree, a parallelizing streaming decision tree induction for distributed enviroment, which is incorporated in SAMOA API is applied on Storm platform. This study compared stream prcessing technique for predicting forest cover types from cartographic variables with traditional classic machine learning algorithms applied on this dataset. The test then train method used in this system is totally different from the traditional train then test. The results of the stream processing technique indicated that it’s output is aymptotically nearly identical to that of a conventional learner, but the model derived from this system is totally scalable, real-time, capable of dealing with evolving streams and insensitive to stream ordering.
What problem does this paper attempt to address?