Abstract:AbstractWe present LDA*, a system that has been deployed in one of the largest Internet companies to fulfil their requirements of "topic modeling as an internal service"---relying on thousands of machines, engineers in different sectors submit their data, some are as large as 1.8TB, to LDA* and get results back in hours. LDA* is motivated by the observation that none of the existing topic modeling systems is robust enough---Each of these existing systems is designed for a specific point in the tradeoff space that can be sub-optimal, sometimes by up to 10×, across workloads.Our first contribution is a systematic study of all recently proposed samplers: AliasLDA, F+LDA, LightLDA, and WarpLDA. We discovered a novel system tradeoff among these samplers. Each sampler has different sampling complexity and performs differently, sometimes by 5×, on documents with different lengths. Based on this tradeoff, we further developed a hybrid sampler that uses different samplers for different types of documents. This hybrid approach works across a wide range of workloads and outperforms the fastest sampler by up to 2x. We then focused on distributed environments in which thousands of workers, each with different performance (due to virtualization and resource sharing), coordinate to train a topic model. Our second contribution is an asymmetric parameter server architecture that pushes some computation to the parameter server side. This architecture is motivated by the skew of the word frequency distribution and a novel tradeoff we discovered between communication and computation. With this architecture, we outperform the traditional, symmetric architecture by up to 2×.With these two contributions, together with a carefully engineered implementation, our system is able to outperform existing systems by up to 10× and has already been running to provide topic modeling services for more than six months.

Industrial Federated Topic Modeling

Familia: An Open-Source Toolkit for Industrial Topic Modeling

Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering

Federated Topic Discovery: A Semantic Consistent Approach

Efficient Ring-topology Decentralized Federated Learning with Deep Generative Models for Industrial Artificial Intelligent

Swarm Learning for Secure and Effective Industrial Federated Big Data Analytics

Federated Neural Topic Models

Federated Tensor Decomposition-Based Feature Extraction Approach for Industrial IoT

Federated Multidiscriminators Multigenerators for Heterogeneous Industrial IoT

AFL: An Adaptively Federated Multi-Task Learning for Model Sharing in Industrial IoT

Peacock: Learning Long-Tail Topic Features for Industrial Applications

A Survey on Federated Learning and its Applications for Accelerating Industrial Internet of Things

Federated Learning in Big Model Era: Domain-Specific Multimodal Large Models

Lightweight Industrial Cohorted Federated Learning for Heterogeneous Assets

Domain-specific Topic Model for Knowledge Discovery in Computational and Data-Intensive Scientific Communities

Efficient and Privacy-Enhanced Federated Learning for Industrial Artificial Intelligence

Heterogeneous Latent Topic Discovery for Semantic Text Mining

Topic Modeling over Short Texts by Incorporating Word Embeddings

LDA*: A Robust and Large-scale Topic Modeling System.

Adaptive Federated Learning and Digital Twin for Industrial Internet of Things