A Categorical Compositional Distributional Modelling for the Language of Life

Yanying Wu,Quanlong Wang
DOI: https://doi.org/10.48550/arXiv.1902.09303
2019-08-13
Abstract:The Categorical Compositional Distributional (DisCoCat) Model is a powerful mathematical model for composing the meaning of sentences in natural languages. Since we can think of biological sequences as the "language of life", it is attempting to apply the DisCoCat model on the language of life to see if we can obtain new insights and a better understanding of the latter. In this work, we took an initial step towards that direction. In particular, we choose to focus on proteins as the linguistic features of protein are the most prominent as compared with other macromolecules such as DNA or RNA. Concretely, we treat each protein as a sentence and its constituent domains as words. The meaning of a word or the sentence is just its biological function, and the arrangement of domains in a protein corresponds to the syntax. Putting all those into the DisCoCat framework, we can "compute" the function of a protein based on the functions of its domains with the grammar rules that combine them together. Since the functions of both the protein and its domains are represented in vector spaces, we provide a novel way to formalize the functional representation of proteins.
Quantitative Methods
What problem does this paper attempt to address?