Abstract:Background: The amount of data in health care is rapidly rising, leading to multiple datasets generated for any given individual. Data integration involves mapping variables in different datasets together to form a combined dataset which can then be used to conduct different types of analyses. However, with increasing numbers of variables, manual mapping of a dataset can become inefficient. Another approach is to use text classification through machine learning to classify the variables to a schema. Objectives: Our aim was to create and evaluate the use of machine learning methods for the integration of data from datasets across health information-seeking behavior (HISB) databases. Methods: Four online databases relevant to the research field were selected for integration. Two experiments were designed for dataset mapping: intra-database mapping using the one data source, and inter-database mapping to map datasets between the four databases. We compared logistic regression (LR), a random forest classifier (RFC), and neural network (NN) models by F1-score for two methods of integration. A third experiment was an ablation study that used all the available data to create a model for classifying HISB variables in a dataset. Results: In intra-database mapping, the mean F1 score for an LR classifier (0.787) was better than the RFC score (0.767) and fully connected NN (0.735). In inter-database mapping, the LR (0.245) scored best, however, this was dependent on which database was used as a training source. Using all the databases, these top three models were able to correctly classify 90-91% of the variables. Removing one dataset improved scores and resulted in a model able to correctly classify 95-96% of the HISB variables. Conclusions: As part of data integration, a neural network can be used as an approach to map the variables of a dataset. The developed models can be used to classify the HISB terms in a database.

Human-in-the-loop Data Integration

Reserch of Entity Matching Based on Multiple Heterogenous Data

Technical Report: Optimizing Human Involvement for Entity Matching and Consolidation.

CrowdER: crowdsourcing entity resolution

A Hybrid Machine-Crowdsourcing Approach For Web Table Matching And Cleaning

Incremental and Interactive Data Integration Approach for Hierarchical Data in Domain of Intelligent Livelihood

Hike: A Hybrid Human-Machine Method for Entity Alignment in Large-Scale Knowledge Bases.

Crowdsourcing Database Systems: Overview and Challenges

Business Cooperation-oriented Heterogeneous System Integration Framework and its Implementation.

A Unified Approach to Matching Semantic Data on the Web

The Interaction Between Schema Matching and Record Matching in Data Integration

Healthcare data integration using machine learning: A case study evaluation with health information-seeking behavior databases

HISMA - A Human-Machine Iterative Schema Matching Algorithm.

Contextual Crowd Intelligence.

Smartint: A Demonstration System For The Interaction Between Schema Mapping And Record Matching

Research of Matching Technology in Data Integration

Amalur: Data Integration Meets Machine Learning

A Survey of Human-in-the-loop for Machine Learning

Crowdsourced Data Management: A Survey.

A Task-Interdependency Model of Complex Collaboration Towards Human-Centered Crowd Work

Crowd-Powered Data Mining