Domain adaptation in small-scale and heterogeneous biological datasets

Seyedmehdi Orouji,Martin C. Liu,Tal Korem,Megan A. K. Peters

2024-05-30

Abstract:Machine learning techniques are steadily becoming more important in modern biology, and are used to build predictive models, discover patterns, and investigate biological problems. However, models trained on one dataset are often not generalizable to other datasets from different cohorts or laboratories, due to differences in the statistical properties of these datasets. These could stem from technical differences, such as the measurement technique used, or from relevant biological differences between the populations studied. Domain adaptation, a type of transfer learning, can alleviate this problem by aligning the statistical distributions of features and samples among different datasets so that similar models can be applied across them. However, a majority of state-of-the-art domain adaptation methods are designed to work with large-scale data, mostly text and images, while biological datasets often suffer from small sample sizes, and possess complexities such as heterogeneity of the feature space. This Review aims to synthetically discuss domain adaptation methods in the context of small-scale and highly heterogeneous biological data. We describe the benefits and challenges of domain adaptation in biological research and critically discuss some of its objectives, strengths, and weaknesses through key representative methodologies. We argue for the incorporation of domain adaptation techniques to the computational biologist's toolkit, with further development of customized approaches.

Quantitative Methods,Machine Learning

What problem does this paper attempt to address?

The paper primarily explores the challenges and potential solutions for applying domain adaptation techniques to small-scale and heterogeneous biological datasets. Specifically, the paper attempts to address the following core issues: 1. **Challenges Brought by Dataset Characteristics**: - The contradiction between small sample size and high-dimensional feature space: Biological datasets usually have a small sample size but high feature dimensions, which can easily lead to model overfitting. - Missing value problem: Missing values are common in biological data, such as the zero-inflation phenomenon in microbiome data. - Feature heterogeneity: The number and arrangement of features vary between different datasets, such as the number of voxels and functional alignment issues in fMRI data for each individual. 2. **Application Limitations of Domain Adaptation Techniques**: - Most existing domain adaptation methods are designed with large-scale datasets (such as images and texts) in mind and are not optimized for small-scale biological datasets. - The unique complexity and heterogeneity of biological data make traditional domain adaptation methods difficult to apply directly. For example, differences in preprocessing steps between different laboratories can further exacerbate data batch effect issues. 3. **Theoretical Limitations**: - The success of domain adaptation is premised on the adaptability between the source domain and the target domain. If the distribution differences between the two domains are too large, negative transfer effects may occur, where the application of source domain knowledge reduces the model performance on the target domain. In summary, this paper aims to comprehensively discuss how to effectively apply domain adaptation techniques to small-scale and highly heterogeneous biological datasets, highlighting the shortcomings of current methods and future research directions.

Domain adaptation in small-scale and heterogeneous biological datasets

A Survey on Domain Generalization for Medical Image Analysis

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Transfer Learning and Deep Domain Adaptation

Hierarchical Domain Adaptation with Local Feature Patterns

Embracing the disharmony in medical imaging: A Simple and effective framework for domain adaptation

Domain Adaptations for Computer Vision Applications

Unsupervised Domain Adaptation: from Simulation Engine to the RealWorld

A survey on domain adaptation theory: learning bounds and theoretical guarantees

An introduction to domain adaptation and transfer learning

Domain adaptation using optimal transport for invariant learning using histopathology datasets

Unsupervised Domain Adaptation in Activity Recognition: A GAN-Based Approach

Domain Adaptation and Generalization on Functional Medical Images: A Systematic Survey

Domain Adaptation Principal Component Analysis: Base Linear Method for Learning with Out-of-Distribution Data

Cross-species Data Classification by Domain Adaptation via Discriminative Heterogeneous Maximum Mean Discrepancy

Medical Image Segmentation with Domain Adaptation: A Survey

DomainATM: Domain Adaptation Toolbox for Medical Data Analysis

A Tutorial on Domain Generalization

Unsupervised Transductive Domain Adaptation

A Review of Domain Adaptation without Target Labels

A Review of Recent Work in Transfer Learning and Domain Adaptation for Natural Language Processing of Electronic Health Records