Abstract:Developing accurate machine learning models for oncology requires large-scale, high-quality multimodal datasets. However, creating such datasets remains challenging due to the complexity and heterogeneity of medical data. To address this challenge, we introduce HoneyBee, a scalable modular framework for building multimodal oncology datasets that leverages foundation models to generate representative embeddings. HoneyBee integrates various data modalities, including clinical diagnostic and pathology imaging data, medical notes, reports, records, and molecular data. It employs data preprocessing techniques and foundation models to generate embeddings that capture the essential features and relationships within the raw medical data. The generated embeddings are stored in a structured format using Hugging Face datasets and PyTorch dataloaders for accessibility. Vector databases enable efficient querying and retrieval for machine learning applications. We demonstrate the effectiveness of HoneyBee through experiments assessing the quality and representativeness of these embeddings. The framework is designed to be extensible to other medical domains and aims to accelerate oncology research by providing high-quality, machine learning-ready datasets. HoneyBee is an ongoing open-source effort, and the code, datasets, and models are available at the project repository.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the challenges of constructing large - scale, high - quality multimodal oncology datasets. Specifically, the paper points out that in current oncology research, the development of accurate machine - learning models requires the following conditions: 1. **Large - scale, high - quality multimodal datasets**: These datasets should contain multiple types of data, such as clinical diagnosis data, pathological imaging data, medical records, reports, molecular data, etc. However, creating such datasets is very difficult because medical data is complex and heterogeneous. 2. **Complexity of data integration and pre - processing**: Raw medical data from different sources (such as cancer research data centers, genomic data repositories, proteomic data repositories, and imaging data repositories) are usually in different formats and have privacy constraints. Integrating these scattered data sources requires a great deal of manual work, including data alignment, quality control, and metadata management. 3. **Lack of an efficient multimodal data - processing framework**: The scale and quality of existing publicly available multimodal oncology datasets vary widely and cannot meet the needs of developing robust machine - learning models. Therefore, there is an urgent need for a framework that can efficiently aggregate and pre - process scattered public medical data to generate feature representations suitable for machine - learning applications. To solve the above problems, the authors propose the HoneyBee framework, which is a modular and extensible platform for constructing multimodal oncology datasets. HoneyBee achieves its goals in the following ways: - **Standardized pre - processing pipeline**: A set of standardized pre - processing procedures has been developed for different data modalities (such as clinical records, imaging data, genomic information, and patient outcomes) to ensure data consistency and reproducibility. - **Generate rich embedding vectors**: Use pre - trained base models to extract feature - rich embedding vectors from the original medical data, capturing complex patterns and relationships in the data. - **Evaluate the effectiveness of the framework**: Evaluate the quality and representativeness of the generated embedding vectors through experiments on large - scale oncology datasets, and verify the effectiveness of the HoneyBee framework. The design goal of the HoneyBee framework is to accelerate oncology research, provide high - quality, machine - learning - ready datasets, and thus support various oncology applications, such as cancer screening, diagnosis, prognosis prediction, treatment response assessment, and postoperative monitoring. ### Formula Explanation In this paper, there are few formulas involved, mainly focusing on descriptive content. If it is necessary to show specific mathematical or physical formulas, Markdown format can be used to present them. For example: - Linear regression formula: \[ y=\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_nx_n \] - Loss function (mean - squared error): \[ L(\theta)=\frac{1}{2m}\sum_{i = 1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2 \] The above formulas are only examples, and the actual paper does not involve complex mathematical formulas.

HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models

Bee Together: Joining Bee Audio Datasets for Hive Extrapolation in AI-Based Monitoring

Multimodal CustOmics: A Unified and Interpretable Multi-Task Deep Learning Framework for Multimodal Integrative Data Analysis in Oncology

Building Flexible, Scalable, and Machine Learning-ready Multimodal Oncology Datasets

Embedding-based Multimodal Learning on Pan-Squamous Cell Carcinomas for Improved Survival Outcomes

HEALNet: Multimodal Fusion for Heterogeneous Biomedical Data

Integration of Domain Knowledge using Medical Knowledge Graph Deep Learning for Cancer Phenotyping

A Framework for Implementing Machine Learning on Omics Data

Multimodal Data Integration for Oncology in the Era of Deep Neural Networks: A Review

Stone Needle: A General Multimodal Large-scale Model Framework towards Healthcare

DinoBloom: A Foundation Model for Generalizable Cell Embeddings in Hematology

A Framework for Evaluating the Efficacy of Foundation Embedding Models in Healthcare

AVBAE-MODFR: A novel deep learning framework of embedding and feature selection on multi-omics data for pan-cancer classification

OligoM-Cancer: A multidimensional information platform for deep phenotyping of heterogenous oligometastatic cancer

A framework for classifying breast cancer via heterogenetic attention mechanism and optimized feature selection

Integrating Heterogeneous Datasets by Using Multimodal Deep Learning

OpenMEDLab: An Open-source Platform for Multi-modality Foundation Models in Medicine

A Scalable Framework for Benchmarking Embedding Models for Semantic Medical Tasks

Enhancing Biomedical Knowledge Discovery for Diseases: An Open-Source Framework Applied on Rett Syndrome and Alzheimer's Disease