Abstract:Globalization places people in a multilingual environment. There is a growing number of users to access and share information in several languages for public or private purpose. In order to deliver relevant information in different languages, efficient multilingual documents management is worthy of study. Generally, classification and clustering are two typical methods for documents management. However, lack of training data and high efforts for corpus annotation will increase the cost for classifying multilingual documents which needs to bridge language gaps as well. Clustering is more suitable to implement in such practical applications. There are two main factors involved in documents clustering, document representation method and clustering algorithm. In this paper, we focus on document representation method and demonstrate that the choice of representation methods has impacts on quality of clustering results. In our experiment, we use parallel corpora (English-Chinese documents on topic of technology information) and comparable corpora (English and Chinese documents on topics of mobile technology and wind energy) as dataset. We compare four different types of document representation methods: Vector Space Model, Latent Semantic Indexing, Latent Dirichlet Allocation and Doc2Vec. Experimental results show that, accuracy of Vector Space Model were not competitive with other methods in all clustering tasks. Latent Semantic Indexing is overly sensitive to corpora itself, for it behaved differently when clustering two different topics of comparable corpora. Latent Dirichlet Allocation behaves best when clustering documents in small size of comparable corpora while Doc2Vec behaves best for large documents set of parallel corpora. Accordingly, characteristics of corpora should be under considerations for rational utilization of document representation methods to have better performance.

Document Representation Methods for Clustering Bilingual Documents

Document Clustering Using Locality Preserving Indexing

Document Clustering Based on Probabilistic Topic Model

Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering

Cross-Lingual Document Clustering Based on Similarity Space Model

Influence of Part-of-Speech on Chinese and English Document Clustering

Document Clustering Based on Word Sense Cluster

Representing Document As Dependency Graph for Document Clustering

Medical Document Clustering Using Ontology-Based Term Similarity Measures

Labeling Clusters from Both Linguistic and Statistical Perspectives: A Hybrid Approach

CLGVSM: Adapting Generalized Vector Space Model to Cross-lingual Document Clustering.

Co-Clustering With Manifold And Double Sparse Representation

A Clustering Algorithm for Short Documents Based On Concept Similarity

K-means Document Clustering Based on Latent Dirichlet Allocation

Multi-documents Automatic Abstracting Based on Text Clustering and Semantic Analysis

Hierarchical Clustering Algorithms for Document Datasets

Research on Chinese Document Dynamic Clustering Under Controlled Vocabularies

Research on Multilingual News Clustering Based on Cross-Language Word Embeddings

Semantic smoothing of document models for agglomerative clustering

Utilizing Different Link Types to Enhance Document Clustering Based on Markov Random Field Model with Relaxation Labeling

Improving Document Clustering by Eliminating Unnatural Language