In Silico Cloning of C17orf32, a Novel Human Gene and Verification of Its Coding Region by RT-PCR

DL Zhang,PG Ding,LJ Ling,RS Chen,DL Ma
DOI: https://doi.org/10.3321/j.issn:1000-3282.2002.04.011
2002-01-01
PROGRESS IN BIOCHEMISTRY AND BIOPHYSICS
Abstract:A novel human gene encoding a protein of 208 amino acids is identified and characterized, which has been offered by HGNC with symbol of C17orf32 and name of chromosome 17 open reading frame 32. The full-length cDNA of 1679 bp for C17orf32 was cloned through a blast search of public databases following the identification of 1 119 bp cDNA obtained by EST assembly with full robotization of SiClone software (created by Chen RS and Ling LJ, and will be released on their website) in ShenWei IV-type supercomputer. Structurally, C17orf32 has one calcitonin / CGRP / IAPP family signature from amino acid 16 to 169, one dihydroorotase signature from amino acid 43 to 117, one tyrosine kinase phosphorylation site from amino acid 68 to 75, and one bipartite nuclear localization signal from amino acid 28 to 45. These motifs. imply the potential biological importance of this gene. Genomic organization analyses show that C17orf32 gene is comprised of six exons, in the size ranging from 43 to 1 101 bp, and five introns, in the size ranging from 163 to 1 124 bp, and spanning 4.61 kb. All of the exon/intron boundaries are consistent with the GT/AG rule, and consensuses surrounding the splice boundaries are found as well. The C17orf32 gene is located on accession NT - 010808.7 in the human chromosome 17, and is only linked with LOC124919, a hypothetical human gene of 889 bp mRNA encoding hypothetical protein XP - 058865 of 260 amino acids supported by XM - 058865. The sequence of LOC124919 has not been verified experimentally. Furthermore, the full-length ORF of 627 bp cDNA from 31 to 654 bp by RT-PCR from the single-stranded human gastric adenocarcinoma MGC803 cell line are cloned and sequenced, which is fully identical with that of the in silico cloning determined by the nucleotide sequencing. Thus,, in silico cloning of C17orf31 gene with GenBank accession number of AY074907 and TPA: BKO00260 is identified solely by bioinformatics analyses. The full-length cDNA sequence of 1 679 bp exhibits very good overall homology to that of LOC123722 of 899 bp mRNA, with matching percentage of 99 % in 78 % of total window and 57 % in 57 % of total window over the full-length nucleotide and protein, respectively. However, the base G in the No. 401 position of LOC123722 cDNA is a redundant insert, which causes a reading frame shift in the translation of an alternative protein. The insert G of LOC123722 is not supported by the experimental clone, and is fully rejected by human EST alignment, and is shown as a redundance by genomic GT/AG organization analysis. C17orf32 gene has 9 putative promoters with possibility of 58 % similar to 97 %, two TATAs, a stop codon in the upstream of ORF, two PolyA signals and a PolyA tail in the downstream of OFF, and accords with Kozak rule around the translation start of the ORF. Based on the above results, it can be concluded that a complete novel human gene is obtained. The full-length gene sequence exhibits little overall homology to any known protein at either the nucleotide or the amino acid level. The two related proteins, with 31 % (in 29 % of total window) and 18 % ( in 18 % of total window) identity over the full-length protein, respectively, are hypothetical caenorhabditis elegans protein F09E5. 11. p of 221 amino acids and polyphosphate kinase [the filamentous nitrogen-fixing cyanobacterium Anabaena sp. strain PCC 71201 of 736 amino acids. Taken together, by combining bioinformatics analyses with experimental verification, a novel human gene C17orf32 is successfully cloned, verified by a series of theoretical and experimental evidence.The strategy will be helpful in discovering more novel human genes, even in correcting errors appeared in NCBI GENOME ANNOTATION PROJECT REFSEQs, such as LOC124919, a model reference sequence predicted from NCBI contig NT - 010808 by automated computational analysis using gene prediction method. Therefore, human genome coding region annotated by computer should be used with caution.
What problem does this paper attempt to address?