Abstract:Classifying the same event reported by different countries is of significant importance for public opinion control and intelligence gathering. Due to the diverse types of news, relying solely on transla-tors would be costly and inefficient, while depending solely on translation systems would incur considerable performance overheads in invoking translation interfaces and storing translated texts. To address this issue, we mainly focus on the clustering problem of cross-lingual news. To be specific, we use a combination of sentence vector representations of news headlines in a mixed semantic space and the topic probability distributions of news content to represent a news article. In the training of cross-lingual models, we employ knowledge distillation techniques to fit two semantic spaces into a mixed semantic space. We abandon traditional static clustering methods like K-Means and AGNES in favor of the incremental clustering algorithm Single-Pass, which we further modify to better suit cross-lingual news clustering scenarios. Our main contributions are as follows: (1) We adopt the English standard BERT as the teacher model and XLM-Roberta as the student model, training a cross-lingual model through knowledge distillation that can represent sentence-level bilingual texts in both Chinese and English. (2) We use the LDA topic model to represent news as a combina-tion of cross-lingual vectors for headlines and topic probability distributions for con-tent, introducing concepts such as topic similarity to address the cross-lingual issue in news content representation. (3) We adapt the Single-Pass clustering algorithm for the news context to make it more applicable. Our optimizations of Single-Pass include ad-justing the distance algorithm between samples and clusters, adding cluster merging operations, and incorporating a news time parameter.

CDW: A Text Clustering Model for Diverse Versions Discovery.

Document Clustering Based on Probabilistic Topic Model

Dvd: A Model For Event Diversified Versions Discovery

A clustering algorithm for distributed time-series data

Parallel Topic Model and Its Application on Document Clustering.

Incorporating Community Detection and Clustering Techniques into Collaborative Filtering Model

Research on Multilingual News Clustering Based on Cross-Language Word Embeddings

Exploiting Wikipedia As External Knowledge For Document Clustering

Scalable Community Discovery on Textual Data with Relations

Multi-document summarization using cluster-based link analysis.

Distributed Information Theoretic Clustering

A Multi-View Clustering Model For Event Detection In Twitter

X-DMM: Fast and Scalable Model Based Text Clustering

A Novel Text Clustering Algorithm Based on Inner Product Space Model of Semantic

A Clustering Algorithm for Short Documents Based On Concept Similarity

A Lda-Based Algorithm For Length-Aware Text Clustering

Application of Improved Clustering Algorithm in Commodity Diversity Recommendation

A Linguistic Feature Based Text Clustering Method.

A Model To Enhance Xml Document Clustering

One-Step Multi-View Clustering With Diverse Representation

A New Text Clustering Method Using Hidden Markov Model