Abstract:With the advent of artificial intelligence (AI) and machine learning (ML), various domains of science and engineering communites has leveraged data-driven surrogates to model complex systems from numerous sources of information (data). The proliferation has led to significant reduction in cost and time involved in development of superior systems designed to perform specific functionalities. A high proposition of such surrogates are built extensively fusing multiple sources of data, may it be published papers, patents, open repositories, or other resources. However, not much attention has been paid to the differences in quality and comprehensiveness of the known and unknown underlying physical parameters of the information sources that could have downstream implications during system optimization. Towards resolving this issue, a multi-source data fusion framework based on Latent Variable Gaussian Process (LVGP) is proposed. The individual data sources are tagged as a characteristic categorical variable that are mapped into a physically interpretable latent space, allowing the development of source-aware data fusion modeling. Additionally, a dissimilarity metric based on the latent variables of LVGP is introduced to study and understand the differences in the sources of data. The proposed approach is demonstrated on and analyzed through two mathematical (representative parabola problem, 2D Ackley function) and two materials science (design of FeCrAl and SmCoFe alloys) case studies. From the case studies, it is observed that compared to using single-source and source unaware ML models, the proposed multi-source data fusion framework can provide better predictions for sparse-data problems, interpretability regarding the sources, and enhanced modeling capabilities by taking advantage of the correlations and relationships among different sources.

Minimally-Supervised Attribute Fusion for Data Lakes

Multi-view Heterogeneous Fusion and Embedding for Categorical Attributes on Mixed Data.

A data-level fusion model for unsupervised attribute selection in multi-source homogeneous data

Visual Feature Fusion and its Application to Support Unsupervised Clustering Tasks

Data-fusion using factor analysis and low-rank matrix completion

Interpretable Multi-Source Data Fusion Through Latent Variable Gaussian Process

Nonparametric fusion learning: synthesize inferences from diverse sources using depth confidence distribution

Searching Data Lakes for Nested and Joined Data

Self-supervised data lakes discovery through unsupervised metadata-driven weighted similarity

Data Fusion: Resolving Conflicts from Multiple Sources

FREYJA: Efficient Join Discovery in Data Lakes

A multi-scale information fusion-based multiple correlations for unsupervised attribute selection

Exploration of Data Fusion Strategies Using Principal Component Analysis and Multiple Factor Analysis

Unsupervised Feature Selection Via Metric Fusion and Novel Low-Rank Approximation

A Heterogeneous Multi-Modal Medical Data Fusion Framework Supporting Hybrid Data Exploration

Unsupervised Data Fusion With Deeper Perspective: A Novel Multisensor Deep Clustering Algorithm

From Data Fusion to Knowledge Fusion

A survey on machine learning for data fusion

Deep Neural Network Fusion via Graph Matching with Applications to Model Ensemble and Federated Learning.

A multi-source heterogeneous spatial big data fusion method based on multiple similarity and voting decision

Scalable data fusion via a scale-based hierarchical framework: Adapting to multi-source and multi-scale scenarios