Abstract: The Naive Bayesian algorithm is one of the ten classical algorithms in data mining, which is widely used as the basic theory for text classification. With the high-speed development of the Internet and information systems, huge amount of data are being produced all the time. Some problems are certain to arise when the traditional Bayesian classification algorithm addresses massive amount of data, especially without the parallel computing framework. This paper proposes an improved Bayesian algorithm INBCS, for text classification in the Spark computing environment and improves the Naive Bayesian algorithm based on a polynomial model. For the data preprocessing, this paper first proposes a parallel noise elimination algorithm, and then proposes another parallel dimension reduction algorithm based on Information Gain and TextRank computation in the Spark environment. Based on these preprocessed data, an improved parallel method is proposed for calculating the conditional probability that comprehensively considers the effects of the feature items in each document, class and training set. Finally, through experiments on different widely used corpuses on the Spark computation platform, the results illustrate that INBCS can obtain higher accuracy and efficiency than some current improvements and implementations of the Naive Bayesian algorithms in Spark ML-library.

A Parallel Algorithm for Bayesian Text Classification Based on Noise Elimination and Dimension Reduction in Spark Computing Environment

Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark

A Technique For Improving The Performance Of Naive Bayes Text Classification

Naive Bayes Based Criminal Text Classification of Unbalanced Classes

DISTRIBUTED PARALLEL IMPLEMENTATION OF A NAIVE BAYESIAN TEXT CLASSIFICATION ALGORITHM

A Novel Text Classification Algorithm Based on Naïve Bayes and KL-Divergence

Parallelization of Classification Algorithms Based on SparkR

Discriminatively Weighted Naive Bayes and Its Application in Text Classification

The Optimization Of Parallel Dbn Based On Spark

Bayesian Naïve Bayes Classifiers to Text Classification

Naive Bayesian Text Classification Algorithm in Cloud Computing Environment

Acceleration of Naive-Bayes algorithm on multicore processor for massive text classification

A Performance Evaluation of Classification Algorithms for Big Data.

Parallel Noise Eliminate: A Parallel Noise Elimination Algorithm for Massive Text Categorization

Bayesian Multinomial Naïve Bayes Classifier to Text Classification.

An Optimal Bayes Classification Algorithm

Parallel Implementation Of Classification Algorithms Based On Mapreduce

An Improved Algorithm of Bayesian Text Categorization.

A New Naive Bayes Text Classification Algorithm

Context Semantic-based Naive Bayesian Algorithm for Text Classification

A Parallelized Semi-Supervised Na(i)ve Bayes Classifier