Healthcare data integration using machine learning: A case study evaluation with health information-seeking behavior databases

Ardalan Mirzaei,Parisa Aslani,Carl R Schneider
DOI: https://doi.org/10.1016/j.sapharm.2022.08.001
Abstract:Background: The amount of data in health care is rapidly rising, leading to multiple datasets generated for any given individual. Data integration involves mapping variables in different datasets together to form a combined dataset which can then be used to conduct different types of analyses. However, with increasing numbers of variables, manual mapping of a dataset can become inefficient. Another approach is to use text classification through machine learning to classify the variables to a schema. Objectives: Our aim was to create and evaluate the use of machine learning methods for the integration of data from datasets across health information-seeking behavior (HISB) databases. Methods: Four online databases relevant to the research field were selected for integration. Two experiments were designed for dataset mapping: intra-database mapping using the one data source, and inter-database mapping to map datasets between the four databases. We compared logistic regression (LR), a random forest classifier (RFC), and neural network (NN) models by F1-score for two methods of integration. A third experiment was an ablation study that used all the available data to create a model for classifying HISB variables in a dataset. Results: In intra-database mapping, the mean F1 score for an LR classifier (0.787) was better than the RFC score (0.767) and fully connected NN (0.735). In inter-database mapping, the LR (0.245) scored best, however, this was dependent on which database was used as a training source. Using all the databases, these top three models were able to correctly classify 90-91% of the variables. Removing one dataset improved scores and resulted in a model able to correctly classify 95-96% of the HISB variables. Conclusions: As part of data integration, a neural network can be used as an approach to map the variables of a dataset. The developed models can be used to classify the HISB terms in a database.
What problem does this paper attempt to address?