Identification of Low-Complexity Domains by Compositional Signatures Reveals Class-Specific Frequencies and functions across the domains of life

Sean M. Cascarina,Eric D. Ross
DOI: https://doi.org/10.1371/journal.pcbi.1011372
2024-05-16
PLoS Computational Biology
Abstract:Low-complexity domains (LCDs) in proteins are typically enriched in one or two predominant amino acids. As a result, LCDs often exhibit unusual structural/biophysical tendencies and can occupy functional niches. However, for each organism, protein sequences must be compatible with intracellular biomolecules and physicochemical environment, both of which vary from organism to organism. This raises the possibility that LCDs may occupy sequence spaces in select organisms that are otherwise prohibited in most organisms. Here, we report a comprehensive survey and functional analysis of LCDs in all known reference proteomes (>21k organisms), with added focus on rare and unusual types of LCDs. LCDs were classified according to both the primary amino acid and secondary amino acid in each LCD sequence, facilitating detailed comparisons of LCD class frequencies across organisms. Examination of LCD classes at different depths (i.e., domain of life, organism, protein, and per-residue levels) reveals unique facets of LCD frequencies and functions. To our surprise, all 400 LCD classes occur in nature, although some are exceptionally rare. A number of rare classes can be defined for each domain of life, with many LCD classes appearing to be eukaryote-specific. Certain LCD classes were consistently associated with identical functions across many organisms, particularly in eukaryotes. Our analysis methods enable simultaneous, direct comparison of all LCD classes between individual organisms, resulting in a proteome-scale view of differences in LCD frequencies and functions. Together, these results highlight the remarkable diversity and functional specificity of LCDs across all known life forms. Many protein sequences contain "low-complexity domains" (LCDs), which are regions in the sequence mostly comprised of only one or two different types of amino acids. Since there are 20 main types of amino acids found in natural proteins, many "flavors" of LCDs are possible, with each flavor having unique structural and functional properties. However, the functions and prevalence of each type of LCD across organisms has not been explored extensively. In this study, we divided LCDs into 400 categories based on the one or two amino acids that were most common in each LCD sequence. Then, using a representative set of all known organisms, we examined how prevalent each type of LCD was and which functions were most often linked to each type of LCD. We uncovered LCD functions common to many organisms as well as LCD functions restricted to certain organisms or LCD types. Some organisms had unusually high levels of specific types of LCDs, suggesting that adaptations in these organisms or special conditions in their environment have aided in the tolerance–or even usage–of those LCD types. Our results give both broad and in-depth views of LCDs, their functions, and their frequencies in nature.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?