Web Page Classification Based on Uncorrelated Semi-Supervised Intra-View and Inter-View Manifold Discriminant Feature Extraction

Xiao-Yuan Jing,Qian Liu,Fei Wu,Baowen Xu,Yangping Zhu,Songcan Chen
2015-01-01
Abstract:Web page classification has attracted increasing research interest. It is intrinsically a multi-view and semi-supervised application, since web pages usually contain two or more types of data, such as text, hyperlinks and images, and unlabeled pages are generally much more than labeled ones. Web page data is commonly high-dimensional. Thus, how to extract useful features from this kind of data in the multi-view semi-supervised scenario is important for web page classification. To our knowledge, only one method is specially presented for this topic. And with respect to a few semi-supervised multi-view feature extraction methods on other applications, there still exists much room for improvement. In this paper, we firstly design a feature extraction schema called semi-supervised intra-view and inter-view manifold discriminant (SI2MD) learning, which sufficiently utilizes the intra-view and inter-view discriminant information of labeled samples and the local neighborhood structures of unlabeled samples. We then design a semi-supervised uncorrelation constraint for the SI2MD schema to remove the multi-view correlation in the semi-supervised scenario. By combining the SI2MD schema with the constraint, we propose an uncorrelated semi-supervised intra-view and inter-view manifold discriminant (USI2MD) learning approach for web page classification. Experiments on public web page databases validate the proposed approach.
What problem does this paper attempt to address?