Abstract:Classifying the same event reported by different countries is of significant importance for public opinion control and intelligence gathering. Due to the diverse types of news, relying solely on transla-tors would be costly and inefficient, while depending solely on translation systems would incur considerable performance overheads in invoking translation interfaces and storing translated texts. To address this issue, we mainly focus on the clustering problem of cross-lingual news. To be specific, we use a combination of sentence vector representations of news headlines in a mixed semantic space and the topic probability distributions of news content to represent a news article. In the training of cross-lingual models, we employ knowledge distillation techniques to fit two semantic spaces into a mixed semantic space. We abandon traditional static clustering methods like K-Means and AGNES in favor of the incremental clustering algorithm Single-Pass, which we further modify to better suit cross-lingual news clustering scenarios. Our main contributions are as follows: (1) We adopt the English standard BERT as the teacher model and XLM-Roberta as the student model, training a cross-lingual model through knowledge distillation that can represent sentence-level bilingual texts in both Chinese and English. (2) We use the LDA topic model to represent news as a combina-tion of cross-lingual vectors for headlines and topic probability distributions for con-tent, introducing concepts such as topic similarity to address the cross-lingual issue in news content representation. (3) We adapt the Single-Pass clustering algorithm for the news context to make it more applicable. Our optimizations of Single-Pass include ad-justing the distance algorithm between samples and clusters, adding cluster merging operations, and incorporating a news time parameter.

Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering

Inducing Word Senses for Cross-lingual Document Clustering

Document Clustering Based on Word Sense Cluster

Document Clustering Using Locality Preserving Indexing

Cross-Lingual Document Clustering Based on Similarity Space Model

CLGVSM: Adapting Generalized Vector Space Model to Cross-lingual Document Clustering.

Statistical Word Sense Aware Topic Models

Clustering of Chinese Sentences Using the SMM Model

Graph-based Clustering for Detecting Semantic Change Across Time and Languages

Semantic Smoothing for Model-based Document Clustering

Semantic smoothing of document models for agglomerative clustering

Topic Models Incorporating Statistical Word Senses

Applying Spectral Clustering for Chinese Word Sense Induction.

Research on Multilingual News Clustering Based on Cross-Language Word Embeddings

Sparse Poisson coding for high dimensional document clustering

Labeling Clusters from Both Linguistic and Statistical Perspectives: A Hybrid Approach

Multi-senses and Multi-Dependencies Discovery among Words

Information Retrieval in long documents: Word clustering approach for improving Semantics

Using Word Sense As a Latent Variable in LDA Can Improve Topic Modeling.

Domain-Aware Word Segmentation for Chinese Language: A Document-Level Context-Aware Model

A Unified Model for Word Sense Representation and Disambiguation.