Linguistic networks uncover grammatical constraints of protein sentences comprised by domain-based words

Adrian A Shimpi,Kristen M. Naegle
DOI: https://doi.org/10.1101/2024.12.04.626803
2024-12-04
Abstract:Evolution has developed a set of principles that determine feasible domain combinations analogous to grammar within natural languages. Treating domains as words and proteins as sentences, made up of words, we apply a linguistic approach to represent the human proteome as an n-gram network. Combining this with network theory and application, we explore the functional language and rules of the human proteome. Additionally, we explored subnetwork languages by focusing on reversible post-translational modifications (PTMs) systems that follow a reader-writer-eraser paradigm. We find that PTM systems appear to sample grammar rules near the onset of the system expansion, but then convergently evolve towards similar grammar rules, which stabilize during the post-metazoan switch. For example, reader and writer domains are typically tightly connected through shared n-grams, but eraser domains are almost always loosely or completely disconnected from readers and writers. Additionally, after grammar fixation, domains with verb-like properties, such as writers and erasers, never appear -- consistent with the idea of natural grammar that leads to clarity and limits futile enzymatic cycles. Then, given how some cancer fusion genes represent the possibility for the emergence of novel language, we investigate how cancer fusion genes alter the human proteome n-gram network. We find most cancer fusion genes follow existing grammar rules. Collectively, these results suggest that n-gram based analysis of proteomes is a complement to the more direct protein-protein interaction networks. N-grams can capture abstract functional connections in a more fully described manner, limited only by the definition of domains within the proteome and not by the combinatorial challenge of capturing all protein interaction connections.
Systems Biology
What problem does this paper attempt to address?
This paper attempts to solve several key problems in the study of protein structure and function by applying linguistics methods: 1. **Defining the language rules of proteins**: The paper regards the domains of proteins as "words" and proteins as "sentences" composed of these "words". By constructing the n - gram network of the human proteome, the authors attempt to define the functional language and rules of the proteome. This is similar to the grammar in natural languages and is used to determine the feasible domain combinations. 2. **Exploring the impact of cancer gene fusions**: The authors studied how cancer gene fusions change the n - gram network of the human proteome. Cancer gene fusions may represent the emergence of new languages, so understanding these changes helps to reveal the changes in protein functions during the occurrence of cancer. 3. **Analyzing the post - translational modification system**: The paper specifically focuses on the post - translational modification (PTM) system under the reader - writer - eraser paradigm. The authors found that the PTM system rapidly samples grammar rules in the early stage of system expansion, but then converges to similar grammar rules and is fixed in the post - metazoan period. For example, readers and writers are usually closely connected through shared n - grams, while erasers are almost always loosely or not connected to readers and writers. 4. **Evaluating n - gram models of different lengths**: The authors evaluated the impact of n - gram models of different lengths on the description of the human proteome. By calculating the information gain and relative entropy of each model, the authors found that the bigram model can provide a large information gain, but it is not sufficient to accurately reproduce the diversity of domain n - grams. In contrast, the n - gram model containing up to 15 domains can capture most of the diversity, but it will lose the information related to longer n - grams in about 5% of the proteome. Finally, the authors believe that the 10 - gram model is sufficient in capturing and maximizing the information encoded in the protein domain architecture. 5. **Studying specific signal sub - networks**: The authors also constructed the n - gram network of the phosphorylation system to explore the unique characteristics between different systems. The results show that the pTyr system generates a fully connected graph, while the pSer/Thr system has multiple connected components, and the n - grams containing phosphatases, 14 - 3 - 3 or MH2 domains are disconnected from other pSer/Thr mechanisms. In addition, the authors found that most PTM systems rarely combine different modules in the same n - gram, especially the eraser domain. In conclusion, by constructing and analyzing the n - gram network of the proteome, this paper aims to reveal the abstract relationships of protein functional connections and provide a new perspective for understanding the function and evolution of the proteome.