Abstract:Today, modern databases with "Big Dimensionality" are experiencing a growing trend. Existing approaches that require the calculations of pairwise feature correlations in their algorithmic designs have scored miserably on such databases, since computing the full correlation matrix (i.e., square of dimensionality in size) is computationally very intensive (i.e., million features would translate to trillion correlations). This poses a notable challenge that has received much lesser attention in the field of machine learning and data mining research. Thus, this paper presents a study to fill in this gap. Our findings on several established databases with big dimensionality across a wide spectrum of domains have indicated that an extremely small portion of the feature pairs contributes significantly to the underlying interactions and there exists feature groups that are highly correlated. Inspired by the intriguing observations, we introduce a novel learning approach that exploits the presence of sparse correlations for the efficient identifications of informative and correlated feature groups from big dimensional data that translates to a reduction in complexity from O(m2n) to O(mlogm + Ka mn), where Ka << min(m,n) generally holds. In particular, our proposed approach considers an explicit incorporation of linear and nonlinear correlation measures as constraints in the learning model. An efficient embedded feature selection strategy, designed to filter out the large number of non-contributing correlations that could otherwise confuse the classifier while identifying the correlated and informative feature groups, forms one of the highlights of our approach. We also demonstrated the proposed method on one-class learning, where notable speedup can be observed when solving one-class problem on big dimensional data. Further, to identify robust informative features with minimal sampling bias, our feature selection strategy embeds the V-fold cross validation in the learning model, so as to seek for features that exhibit stable or consistent performance accuracy on multiple data folds. Extensive empirical studies on both synthetic and several real-world datasets comprising up to 30 million dimensions are subsequently conducted to assess and showcase the efficacy of the proposed approach.

Low-Latency ML Inference by Grouping Correlated Data Objects and Computation

Machine Learning Systems for Highly-Distributed and Rapidly-Growing Data

Efficient Multi-stage Inference on Tabular Data

Hierarchical and Distributed Machine Learning Inference Beyond the Edge

Towards Real-Time Inference Offloading with Distributed Edge Computing: the Framework and Algorithms

Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

Responsive ML inference in multi-tenanted environments using AQUA

Efficient Evaluation of Composite Correlations for Streaming Time Series

Optimizing Machine Learning Inference Queries with Correlative Proxy Models

CoLLM: A Collaborative LLM Inference Framework for Resource-Constrained Devices

Efficient Digital Twin Data Processing for Low-Latency Multicast Short Video Streaming

Lightweight Distributed Gaussian Process Regression for Online Machine Learning

Automatic Task Parallelization of Dataflow Graphs in ML/DL models

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

Making Trillion Correlations Feasible in Feature Grouping and Selection

Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference

Asynchronous Local Computations in Distributed Bayesian Learning

Scalable marginalization of correlated latent variables with applications to learning particle interaction kernels

Low-Rate Feature Compression for Collaborative Intelligence: Reducing Redundancy in Spatial and Statistical Levels

Efficient Communication Scheduling for Parameter Synchronization of DML in Data Center Networks

AdaFlow: Opportunistic Inference on Asynchronous Mobile Data with Generalized Affinity Control