Proteome structuring of crown-of-thorns starfish
Yunchi Zhu,Zuhong Lu
DOI: https://doi.org/10.3389/fmars.2024.1487904
IF: 5.247
2024-10-09
Frontiers in Marine Science
Abstract:The crown-of-thorns starfish (COTS, Acanthaster planci) is a highly fecund predator of reef-building corals throughout the Indo-Pacific region (1). COTS population outbreaks cause significant damage to coral reefs, the living environment for more than 30% of marine animals and plants (2), leading to a loss of coral cover and biodiversity. Scientists have sequenced the COTS genome (1), which provides a wealth of information on the genetic basis of COTS biology. By identifying specific genes and proteins involved in these behaviors, scientists are able to gain a deeper understanding of their reproductive strategies and the factors contributing to outbreaks, so as to develop targeted biocontrol methods such as peptide mimetics to disrupt COTS aggregation. However, the function annotation of COTS proteome turns out to be incomplete, with over 20% of proteins being annotated as "uncharacterized". Traditional sequence-based annotation methods may be insufficient for fully resolving genomes, particularly for non-model organisms. It is commonly recognized that "sequence determines structure, and structure determines function." If proteome structuring, by which sequences can be transformed into accurate structures in a high-throughput way, is as feasible as genome and transcriptome sequencing, it is believed that such approach could not only substantially aid researchers in complementing and correcting protein annotations, but also pave a new dimension for protein data mining (3). This vision is going to be realized with the help of the booming artificial intelligence (AI) technology. AI-based protein structure prediction systems represented by RoseTTAFold (4) and AlphaFold2 (5) have brought the dawn of the high-throughput era of structural proteomics. With accuracy not inferior to traditional methods such as x-ray crystallography and cryo-electron microscopy (4), they have overwhelming advantages in cost, efficiency and ease of operation. As of July 2022, the AlphaFold Protein Structure Database (AFDB) boasts open access to a staggering collection of over 200 million protein structures (6), marking a 1000-fold increase compared to the 50-year accumulation of the PDB. Furthermore, owing to collaborative efforts within the open-source community, many optimized versions of AlphaFold2, which are collectively referred to as AlphaFold-like systems have been developed, with ColabFold (7) standing out as a notable example. It significantly reduces the resource demands for protein folding, empowering more researchers to engage in personalized structure predictions and expanding the overall data scale. Up to now, AlphaFold-like systems present to be the most robust tools for high-throughput proteome structuring (8), facilitating a diverse array of research endeavors. Notable examples include ColabFold proteome CP-8382 from Southeast University (2), structural proteome of Sphagnum divinum from Oak Ridge National Laboratory (9), AlphaFold proteome of Mnemiopsis leidyi from National Institutes of Health (10), etc. These explorations are progressively expanding the protein structure universe and enabling new insights into protein function and biology.Here we present the proteome structuring of COTS. Deploying ColabFold in the Big Data Computing Center at Southeast University, we predicted 31,743 protein structures. The resulting dataset covers 60.4% of residues with a confident prediction and 35.5% with very high confidence. We also performed a preliminary structural bioinformatics analysis using several post-AlphaFold methods, including fast structure clustering, ligand transplanting and structure-based Gene Ontology (GO) annotation.Materials and methodsThe NCBI RefSeq of Acanthaster planci (GCF_001949145.1) was used as sequence source. Protein sequences were downloaded and filtered, discarding those exceeding 2,550 aa to accommodate the upper limit of GPU memory. Multiple sequence alignments (MSA) generation were conducted locally (colabfold_search), then MSAs in A3M format were uploaded to ColabFold 1.5.2 on the NVIDIA Tesla V100 cluster at the Big Data Computing Center of Southeast University. The parameters of ColabFold were set to -amber, -num-recycle 3, -use-gpu-relax, --zip, --num-relax 1.During the structure prediction process, the MineProt ( 11) toolkit (colabfold/import.sh --name-mode 1 --zip --relax) was periodically executed to process predicted proteins. This included selection of best structure models with highest predicted local distance difference test (pLDDT) scores, generation of CIF files, and storage of model scores in JSON format.Foldseek (12) was employed for high-throughput structure alignment clustering. Predicted structures were aligned to the AlphaFold Clusters (13) using easy-search -e 0.01 -s 7.5, and were clustered using easy-cluster -c 0.9 -e 0.01 --min-seq-id 0.5. Uncharacterized proteins clustered with annotated COTS proteins were selected, then t -Abstract Truncated-
marine & freshwater biology