[Correction of Five Different Types of Errors of Model REFSEQs Appeared in NCBI Human Gene Database Only by Using Two Novel Human Genes C17orf32 and ZNF362].
De-Li Zhang,Yan-Da Li,Liang Ji
IF: 5.723
2004-01-01
Journal of Genetics and Genomics
Abstract:Found that there exist many mistakes in the REFSEQ issued in the genome annotation project of NCBI, the result of which indicates that people be cautious in using REFSEQ database in NCBI. By adopting the technical route combining bioinformatics analysis and experimental verification, through the comparison of the cloned genes in the non-redundant database, we found that there were many mistakes in the computer annotation human genome coding sequences that were issued on the internet. First we quoted nine wrong types of novel human genes anticipated by NCBI GENOME Annotation Project. Here we give one example in detail: (1) Comparison of the sequences between novel human gene C17orf32 and hypothetical human gene LOC124919. LOC123722 is a modified sequence of C17orf32 cDNA with an inserted G between 406 -407 nucleotides. The base G in the 401 position of LOC123722 cDNA is a redundant insert, which causes a reading frame shift in the translation of an alternative protein. This inserted G has not been found in our experimental clone, and is fully rejected by human EST alignment, and is shown as a redundance by genomic GT/AG organization analysis. (2) Comparison of the sequences between novel human gene C17orf32 and hypothetical human gene LOC147007. C17orf32 gene (ORF from 31 to 657 nucleotides) is located on human chromosome 17(Accession No. NT_010808.7), and is only linked with a hypothetical human gene LOC147007 (ORF from 55 to 435 nucleotides) at present. This hypothetical human gene sequence has not been verified by experiment, and is a wrong form of our verified C17orf32 gene. The full-length 1 679 bp cDNA sequence of C17orf32 exhibits overall homology to that of LOC147007 of 625 bp mRNA, with matching percentage of 37% in 36% of total window over the full-length nucleotide, especially 121 approximately 366 bp of LOC147007 is just the same as 316 approximately 561 bp of C17orf32. Thus, the 126 aa protein encoded by XP_097165 of LOC147007 exhibits overall homology to the 208 aa protein encoded by C17orf32, with matching percentage of 50% in 48% of total window over the full-length protein, especially 23 approximately 104 aa of XP_097165 is just the same as 96 approximately 177 aa of C17orf32 protein. Both flanking regions of LOC147007 outside the same ORF central part are wrong assembly of non-relative cDNA. In addition, we have in silico cloned a novel mouse gene, ORF32 (open reading frame 32) with TPA accession number of BK000258, which is the mouse ortholog of human C17orf32. Our strategy is helpful in both finding out more novel human genes and correcting the mistakes in the REFSEQs issued by NCBI genome annnotation project. For example, we adopted the gene anticipating method, through automatic calculation and analysis, anticipated two modes reference sequences (LOC124919 and LOC147007) from NCBI contig NT_ 010808. Both of them should be C17orf32, but the fact is that both of them are various wrong forms of C17orf32, respectively are the first type and second type of mistakes. Another example, we adopted gene anticipation method, through automatic calculation and analysis, anticipated three modes reference sequences (LOC14907, LOC200084 and LOC91126) from NCBI contig NT_004511 which really are one type of gene of ZNF362, but submitted three different wrong forms of ZNF362, respectively are: the fourth, fifth, and seventh type of mistakes. We can correct or avoid the currently wrong human genome coding sequence by using in silico clone and combining experimental verification. People should be cautious in treating the computer's annotation which may exist all type of wrong human genome coding sequences. The correct identification and annotation of the novel human genes still remain to be a long and arduous task.