Abstract:Many applications require categorization of text documents using predefined categories. The main approach to performing text categorization is learning from labeled examples. For many tasks, it may be difficult to find examples in one language but easy in others. The problem of learning from examples in one or more languages and classifying (categorizing) in another is called cross-lingual learning. In this work, we present a novel approach that solves the general cross-lingual text categorization problem. Our method generates, for each training document, a set of language-independent features. Using these features for training yields a language-independent classifier. At the classification stage, we generate language-independent features for the unlabeled document, and apply the classifier on the new representation. To build the feature generator, we utilize a hierarchical language-independent ontology, where each concept has a set of support documents for each language involved. In the preprocessing stage, we use the support documents to build a set of language-independent feature generators, one for each language. The collection of these generators is used to map any document into the language-independent feature space. Our methodology works on the most general cross-lingual text categorization problems, being able to learn from any mix of languages and classify documents in any other language. We also present a method for exploiting the hierarchical structure of the ontology to create virtual supporting documents for languages that do not have them. We tested our method, using Wikipedia as our ontology, on the most commonly used test collections in cross-lingual text categorization, and found that it outperforms existing methods.

Domain and Language Independent Feature Extraction for Statistical Text Categorization

CLDA: Feature Selection for Text Categorization Based on Constrained LDA

Aggressive Dimensionality Reduction With Reinforcement Local Feature Selection For Text Categorization

Text Categorization Based on Domain Ontology

A Pca Based Automatic Image Categorization Approach Using Dominant Color Features

Language Independent Text Categorization.

A multiclass classification framework for document categorization

Feature extraction based on principal component analysis for text categorization

Dimensionality Reduction With Category Information Fusion And Non-Negative Matrix Factorization For Text Categorization

Learning Effective Features for Chinese Text Categorization

A General Framework of Feature Selection for Text Categorization

An Effective Feature Selection Method For Text Categorization

Automatic Generation of Language-Independent Features for Cross-Lingual Classification

Distributional Features for Text Categorization

Exploiting Textual and Visual Features for Image Categorization

A New Approach of Feature Selection for Text Categorization

A class-feature-centroid classifier for text categorization

Collaborative Work with Linear Classifier and Extreme Learning Machine for Fast Text Categorization

Scalable Term Selection for Text Categorization.

Text Categorization Based On. Concept Indexing and Principal Component Analysis

Non-Negative Sparse Semantic Coding for Text Categorization