Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

Joseph Lee,Shu Yang,Jae Young Baik,Xiaoxi Liu,Zhen Tan,Dawei Li,Zixuan Wen,Bojian Hou,Duy Duong-Tran,Tianlong Chen,Li Shen
2024-10-03
Abstract:Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: <a class="link-external link-https" href="https://github.com/PennShenLab/FREEFORM" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language,Genomics
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of predicting complex phenotypes based on genotypic data. Specifically, the author focuses on how to select a small subset of explanatory features from a large number of genetic variations and conduct accurate phenotype prediction on this basis. Traditional data - driven methods face many challenges when dealing with high - dimensional genotypic data, such as over - fitting, multicollinearity, and the interpretability of feature interaction terms. To this end, the author proposes a knowledge - driven framework based on large - language models (LLMs) - **FREEFORM**, for feature selection and feature engineering of genotypic data. #### Main problems and challenges 1. **"Curse of dimensionality" brought by high - dimensional data**: - Genotypic data usually contains thousands or even millions of features (such as single - nucleotide polymorphisms, SNPs), which makes data analysis and prediction difficult. - High - dimensional data is prone to over - fitting, especially when the sample size is limited. 2. **Complexity of feature selection and feature engineering**: - Data - driven methods (such as Lasso regression) perform poorly in the case of small samples. - Feature engineering is a labor - intensive process and requires professional knowledge to avoid multiple - testing problems. 3. **Balance between interpretability and prediction performance**: - Complex models may improve prediction performance but are often difficult to interpret; simple models may not be able to capture complex interactions between genes. #### Solutions The author proposes a knowledge - driven framework named **FREEFORM**, which uses pre - trained large - language models (LLMs) to perform feature selection and feature engineering. This framework solves the above problems in the following ways: - **Feature selection**: Use the knowledge of LLM to screen the most informative genetic variations and reduce the feature dimension. - **Feature engineering**: Generate new features based on the selected features, especially those interaction terms that are easier to interpret. - **Ensemble learning**: Through multiple iterations and different prompting strategies, enhance the robustness and generalization ability of the model. #### Application scenarios The author evaluated the performance of the **FREEFORM** framework on two actual genotype - phenotype data sets: 1. **Genetic ancestry prediction**: Use data from the 1000 Genomes Project to predict the super - population ancestry of individuals (such as Africa, America, East Asia, etc.). 2. **Hereditary hearing loss prediction**: Use a smaller genotypic data set to predict whether an individual has hereditary hearing loss. The experimental results show that **FREEFORM** significantly outperforms traditional data - driven methods in the case of low sample size (few - shot), especially in the genetic ancestry prediction task, and can achieve similar performance to LASSO with only a small number of samples. In conclusion, this paper solves the problems of high - dimensionality, interpretability, and sample size limitations in feature selection and feature engineering of genotypic data by introducing the knowledge - driven method of LLM, providing new ideas for the efficient analysis of genotypic data.