Abstract:The crown-of-thorns starfish (COTS, Acanthaster planci) is a highly fecund predator of reef-building corals throughout the Indo-Pacific region (1). COTS population outbreaks cause significant damage to coral reefs, the living environment for more than 30% of marine animals and plants (2), leading to a loss of coral cover and biodiversity. Scientists have sequenced the COTS genome (1), which provides a wealth of information on the genetic basis of COTS biology. By identifying specific genes and proteins involved in these behaviors, scientists are able to gain a deeper understanding of their reproductive strategies and the factors contributing to outbreaks, so as to develop targeted biocontrol methods such as peptide mimetics to disrupt COTS aggregation. However, the function annotation of COTS proteome turns out to be incomplete, with over 20% of proteins being annotated as "uncharacterized". Traditional sequence-based annotation methods may be insufficient for fully resolving genomes, particularly for non-model organisms. It is commonly recognized that "sequence determines structure, and structure determines function." If proteome structuring, by which sequences can be transformed into accurate structures in a high-throughput way, is as feasible as genome and transcriptome sequencing, it is believed that such approach could not only substantially aid researchers in complementing and correcting protein annotations, but also pave a new dimension for protein data mining (3). This vision is going to be realized with the help of the booming artificial intelligence (AI) technology. AI-based protein structure prediction systems represented by RoseTTAFold (4) and AlphaFold2 (5) have brought the dawn of the high-throughput era of structural proteomics. With accuracy not inferior to traditional methods such as x-ray crystallography and cryo-electron microscopy (4), they have overwhelming advantages in cost, efficiency and ease of operation. As of July 2022, the AlphaFold Protein Structure Database (AFDB) boasts open access to a staggering collection of over 200 million protein structures (6), marking a 1000-fold increase compared to the 50-year accumulation of the PDB. Furthermore, owing to collaborative efforts within the open-source community, many optimized versions of AlphaFold2, which are collectively referred to as AlphaFold-like systems have been developed, with ColabFold (7) standing out as a notable example. It significantly reduces the resource demands for protein folding, empowering more researchers to engage in personalized structure predictions and expanding the overall data scale. Up to now, AlphaFold-like systems present to be the most robust tools for high-throughput proteome structuring (8), facilitating a diverse array of research endeavors. Notable examples include ColabFold proteome CP-8382 from Southeast University (2), structural proteome of Sphagnum divinum from Oak Ridge National Laboratory (9), AlphaFold proteome of Mnemiopsis leidyi from National Institutes of Health (10), etc. These explorations are progressively expanding the protein structure universe and enabling new insights into protein function and biology.Here we present the proteome structuring of COTS. Deploying ColabFold in the Big Data Computing Center at Southeast University, we predicted 31,743 protein structures. The resulting dataset covers 60.4% of residues with a confident prediction and 35.5% with very high confidence. We also performed a preliminary structural bioinformatics analysis using several post-AlphaFold methods, including fast structure clustering, ligand transplanting and structure-based Gene Ontology (GO) annotation.Materials and methodsThe NCBI RefSeq of Acanthaster planci (GCF_001949145.1) was used as sequence source. Protein sequences were downloaded and filtered, discarding those exceeding 2,550 aa to accommodate the upper limit of GPU memory. Multiple sequence alignments (MSA) generation were conducted locally (colabfold_search), then MSAs in A3M format were uploaded to ColabFold 1.5.2 on the NVIDIA Tesla V100 cluster at the Big Data Computing Center of Southeast University. The parameters of ColabFold were set to -amber, -num-recycle 3, -use-gpu-relax, --zip, --num-relax 1.During the structure prediction process, the MineProt ( 11) toolkit (colabfold/import.sh --name-mode 1 --zip --relax) was periodically executed to process predicted proteins. This included selection of best structure models with highest predicted local distance difference test (pLDDT) scores, generation of CIF files, and storage of model scores in JSON format.Foldseek (12) was employed for high-throughput structure alignment clustering. Predicted structures were aligned to the AlphaFold Clusters (13) using easy-search -e 0.01 -s 7.5, and were clustered using easy-cluster -c 0.9 -e 0.01 --min-seq-id 0.5. Uncharacterized proteins clustered with annotated COTS proteins were selected, then t -Abstract Truncated-

Protein Target Selection for Structural Genomics of Thermoanaerobacter Tengcongensis

An analysis of the proteomic profile forThermoanaerobacter tengcongensis under optimal culture conditions

Quantitative Proteomics Reveals the Temperature-Dependent Proteins Encoded by a Series of Cluster Genes in Thermoanaerobacter Tengcongensis

The proteomic studies on Thermoanaerobacter tengcongensis

The proteomic alterations of Thermoanaerobacter tengcongensis cultured at different temperatures.

Expression, Purification, Crystallization and Preliminary Crystallographic Study of a Potential Metal-Dependent Hydrolase with Cyclase Activity Fromthermoanaerobacter Tengcongensis

A Computational Study of Shewanella Oneidensis MR-1: Structural Prediction and Functional Inference of Hypothetical Proteins.

Proteomic analysis of membrane proteins from a radioresistant and moderate thermophilic bacterium Deinococcus geothermalis.

qProtein: Exploring Physical Features of Protein Thermostability Based on Structural Proteomics

Parallel cloning, expression, purification and crystallization of human proteins for structural genomics.

Developments in Structural Genomics: Protein Purification and Function Interpretation

Improving Solubility of Shewanella Oneidensis MR-1 and Clostridium Thermocellum JW-20 Proteins Expressed into Esherichia Coli.

TIMomics: Genome-Wide Search for Evolutionary Relationships among TIM (Triose-Phosphate Isomerase) Fold Proteinsviastructural Genomics Approaches

A Large-Scale, High-Efficiency and Low-Cost Platform for Structural Genomics Studies.

Survey of Acetylation for Thermoanaerobacter tengcongensis

A Targetron System for Gene Targeting in Thermophiles and Its Application in Clostridium Thermocellum

Designing of thermostable proteins with a desired melting temperature

Backbone solution structures of proteins using residual dipolar couplings: application to a novel structural genomics target

A High-Efficiency, Low-Cost Platform for Structural Genomics Studies at Peking University

Proteome structuring of crown-of-thorns starfish

Identification and characterization of proteins of unknown function (PUFs) in Clostridium thermocellum DSM 1313 strains as potential genetic engineering targets