HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models

Aakash Tripathi,Asim Waqas,Yasin Yilmaz,Ghulam Rasool
2024-06-14
Abstract:Developing accurate machine learning models for oncology requires large-scale, high-quality multimodal datasets. However, creating such datasets remains challenging due to the complexity and heterogeneity of medical data. To address this challenge, we introduce HoneyBee, a scalable modular framework for building multimodal oncology datasets that leverages foundation models to generate representative embeddings. HoneyBee integrates various data modalities, including clinical diagnostic and pathology imaging data, medical notes, reports, records, and molecular data. It employs data preprocessing techniques and foundation models to generate embeddings that capture the essential features and relationships within the raw medical data. The generated embeddings are stored in a structured format using Hugging Face datasets and PyTorch dataloaders for accessibility. Vector databases enable efficient querying and retrieval for machine learning applications. We demonstrate the effectiveness of HoneyBee through experiments assessing the quality and representativeness of these embeddings. The framework is designed to be extensible to other medical domains and aims to accelerate oncology research by providing high-quality, machine learning-ready datasets. HoneyBee is an ongoing open-source effort, and the code, datasets, and models are available at the project repository.
Machine Learning,Artificial Intelligence,Databases
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to address the challenges of constructing large - scale, high - quality multimodal oncology datasets. Specifically, the paper points out that in current oncology research, the development of accurate machine - learning models requires the following conditions: 1. **Large - scale, high - quality multimodal datasets**: These datasets should contain multiple types of data, such as clinical diagnosis data, pathological imaging data, medical records, reports, molecular data, etc. However, creating such datasets is very difficult because medical data is complex and heterogeneous. 2. **Complexity of data integration and pre - processing**: Raw medical data from different sources (such as cancer research data centers, genomic data repositories, proteomic data repositories, and imaging data repositories) are usually in different formats and have privacy constraints. Integrating these scattered data sources requires a great deal of manual work, including data alignment, quality control, and metadata management. 3. **Lack of an efficient multimodal data - processing framework**: The scale and quality of existing publicly available multimodal oncology datasets vary widely and cannot meet the needs of developing robust machine - learning models. Therefore, there is an urgent need for a framework that can efficiently aggregate and pre - process scattered public medical data to generate feature representations suitable for machine - learning applications. To solve the above problems, the authors propose the HoneyBee framework, which is a modular and extensible platform for constructing multimodal oncology datasets. HoneyBee achieves its goals in the following ways: - **Standardized pre - processing pipeline**: A set of standardized pre - processing procedures has been developed for different data modalities (such as clinical records, imaging data, genomic information, and patient outcomes) to ensure data consistency and reproducibility. - **Generate rich embedding vectors**: Use pre - trained base models to extract feature - rich embedding vectors from the original medical data, capturing complex patterns and relationships in the data. - **Evaluate the effectiveness of the framework**: Evaluate the quality and representativeness of the generated embedding vectors through experiments on large - scale oncology datasets, and verify the effectiveness of the HoneyBee framework. The design goal of the HoneyBee framework is to accelerate oncology research, provide high - quality, machine - learning - ready datasets, and thus support various oncology applications, such as cancer screening, diagnosis, prognosis prediction, treatment response assessment, and postoperative monitoring. ### Formula Explanation In this paper, there are few formulas involved, mainly focusing on descriptive content. If it is necessary to show specific mathematical or physical formulas, Markdown format can be used to present them. For example: - Linear regression formula: \[ y=\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_nx_n \] - Loss function (mean - squared error): \[ L(\theta)=\frac{1}{2m}\sum_{i = 1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2 \] The above formulas are only examples, and the actual paper does not involve complex mathematical formulas.