A PARALLEL AND MODULAR PATTERN CLASSIFICATION FRAMEWORK FOR LARGE-SCALE PROBLEMS

Bao-liang Lu,Xiao-lin Wang
DOI: https://doi.org/10.1142/9789814273398_0032
2009-01-01
Abstract:The number of samples that are available on the internet to train pattern classifiers is increasing rapidly, while traditional pattern classification techniques based on a single computer system are powerless to process these large-scale data sets. This chapter presents a parallel and modular pattern classification framework for coping with large-scale pattern classification problems. The proposed framework follows a divide-andconquer strategy that easily assigns a given large-scale problem to an available parallel and distributed computing infrastructure. The framework consists of three independent parts: decomposing training data sets, training component classifiers in a parallel way, and combining trained component classifiers. In order to evaluate the performance of the proposed framework, we perform experiments on a large-scale Japanese patent classification problem, containing about 3,500,000 patent documents. The experimental results show that our framework has the following attractive features: (a) The framework is general, and therefore any traditional pattern classification techniques such as support vector machines can be easily embedded in the framework as component classifiers. (b) The framework can incorporate explicit domain or prior knowledge into learning through the process of dividing training data sets. (c) The framework has good scalability and is easily implementable in hardware.
What problem does this paper attempt to address?