Cell morphological representations of genes enhance prediction of drug targets

Niveditha S. Iyer,Daniel J. Michael,S-Y Gordon Chi,John Arevalo,Srinivas Niranj Chandrasekaran,Anne E. Carpenter,Pranav Rajpurkar,Shantanu Singh
DOI: https://doi.org/10.1101/2024.06.08.598076
2024-06-10
Abstract:Identifying how a given chemical of interest exerts its impact on biological systems is a critical step in developing new medicines and chemical products. The mechanism of a query compound of interest can sometimes be identified when its image-based morphological profile matches a compound in a library of well-annotated compound profiles. In this study, we demonstrate a significant improvement in classification performance by incorporating side information: gene representations. We generate these representations using the morphological profiles of cells where the level of a single gene's expression has been artificially increased or decreased. The genes are selected as those encoding known protein targets of annotated compounds in the library. A transformer model is trained to classify gene-compound pairs, where each pair represents a potential interaction between a gene and a compound, as true or false. Subsequently, the model generates a ranked list of likely target genes for a previously unseen query compound. Although the strategy exhibits high performance only for compounds that target previously encountered genes - likely due to the limited size of our training dataset - the performance increase demonstrates a notable improvement over simply matching compound profiles directly to compound profiles or to gene profiles. Larger datasets may improve the prediction capabilities of this approach, enabling the prediction of gene targets for novel compounds, which can then be experimentally validated.
Bioinformatics
What problem does this paper attempt to address?
The paper aims to address a key issue in the drug discovery process: how to efficiently identify potential protein targets for candidate drugs. Specifically, the research focuses on using cellular morphological phenotype information to enhance the accuracy of drug target prediction. The research background includes: - The current cost of drug development is high and continues to rise. - With advances in genomics, the discovery of new diseases and their subtypes is accelerating, making it difficult for traditional drug screening methods to keep pace. - Existing target deconvolution techniques are often expensive, time-consuming, and uncertain in their results. To address these issues, the authors propose a machine learning-based approach that utilizes high-dimensional data generated by cell imaging technologies (such as Cell Painting) to train models to predict the interactions between compounds and specific genes. Specifically, the key innovation of this method lies in the integration of the morphological phenotype information of genes, that is, the cellular morphological changes produced by increasing or decreasing the expression level of a single gene, and using these changes as a feature representation of the genes. The main contributions of the study include: 1. **Model Design**: The Transformer architecture was adopted, which is a deep learning model widely used in natural language processing for sequence-to-sequence tasks. In this study, the morphological phenotypes of genes are encoded as embeddings, and the features of compounds are also transformed into corresponding embeddings. The task of the model is to infer the potential targets of compounds from the gene embeddings. 2. **Experimental Design**: A dataset containing 302 compounds and 160 genes (CPJUMP1) was used, which was carefully selected to cover a variety of different biological targets. The generalization ability of the model was evaluated by dividing the dataset in different ways (such as leave-one-compound-out, leave-one-gene-out strategies). 3. **Performance Improvement**: Compared to traditional methods that only match compound features, the model significantly improved prediction accuracy, especially in predicting new compounds for known genes. However, the performance of the model declined when predicting new compounds for unknown genes. In summary, this study demonstrates that integrating gene morphological phenotype information into machine learning models can effectively improve the accuracy of drug target prediction, which is of significant importance for accelerating the drug discovery process. However, to achieve accurate predictions for completely unknown genes, further expansion of the dataset size and exploration of more advanced model architectures and techniques are needed.