Abstract:We found that human genome coding regions annotated by computers have different kinds of many errors in public domain through homologous BLAST of our cloned genes in non-redundant (nr) database, including insertions, deletions or mutations of one base pair or a segment in sequences at the cDNA level, or different permutation and combination of these errors. Basically, we use the three means for validating and identifying some errors of the model genes appeared in NCBI GENOME ANNOTATION PROJECT REFSEQS: (I) Evaluating the support degree of human EST clustering and draft human genome BLAST. (2) Preparation of chromosomal mapping of our verified genes and analysis of genomic organization of the genes. All of the exon/intron boundaries should be consistent with the GT/AG rule, and consensuses surrounding the splice boundaries should be found as well. (3) Experimental verification by RT-PCR of the in silico cloning genes and further by cDNA sequencing. And then we use the three means as reference: (1) Web searching or in silico cloning of the genes of different species, especially mouse and rat homologous genes, and thus judging the gene existence by ontology. (2) By using the released genes in public domain as standard, which should be highly homologous to our verified genes, especially the released human genes appeared in NCBI GENOME ANNOTATION PROJECT REFSEQS, we try to clone each a highly homologous complete gene similar to the released genes in public domain according to the strategy we developed in this paper. If we can not get it, our verified gene may be correct and the released gene in public domain may be wrong. (3) To find more evidence, we verified our cloned genes by RT-PCR or hybrid technique. Here we list some errors we found from NCBI GENOME ANNOTATION PROJECT REFSEQs: (1) Insert a base in the ORF by mistake which causes the frame shift of the coding amino acid. In detail, abase in the ORF of a gene is a redundant insertion, which causes a reading frame shift in the translation of an alternative protein, such as LOC124919 is wrong form of C17 orf32 (with mouse and rat orthologs determined by us). (2) Put together by mistake (with force). This is a wrong assembly of non-relating cDNA segment, such as LOC147007 is wrong form of C17orf32. (3) Mistakenly insert a base or one section of cDNA in the ORF which causes it ending beforehand, only coding cDNA sequence of N-terminal amino acids, incomplete. For example, LOC123722 is wrong form of SPRYD1, and even the human hypothetical gene LOC126250 or PDCD5 is wrong form of our PDCD5 (TFAR19). (4) Incomplete, only coding cDNA sequence of C-terminal amino acids. For example, human LOC149076 and mouse LOC230761 are wrong form of our verified human ZNF362 and mouse Zfp362, respectively. (5) Incomplete, only coding one section of coding protein cDNA sequence of correct gene ORF, lacking N-terminal and C-terminal amino acids sequence, and at the same time, mistakenly anticipates the first non-initiation codon amino acid of the incomplete protein amino acid as the initiation codon, e.g. anticipating L as M. For example, LOC200084 is wrong form of ZNF362. (6) Mistakenly insert a base or one section of cDNA in the ORF, wrongly causing unwanted termination codon before the insertion, so the coding protein lacks the first part of the amino acids. For example, the GenBank Acc. No. AL096883 ( LOCUS No. HS323M22B) is wrong form of an experimentally verified human NM_012263 with mouse ortholog of BC010510 determined. (7) It may regard the polluted genomic sequence as complete gene cDNA sequence and anticipate the so-called single exon gene, even the real one, only a small ORF in the very long single exon mRNA, while there really exists termination code in the same phase of the upper part of the ORF initiation code, no other characters accord with the gene's condition. For example, LOC91126 is wrong form of ZNF362. (8) The anticipated genes only have ORF which has no EST proofs on both terminal sides. Depending on this ORF, a complete gene cDNA with double support of EST and human genome (there are termination codes at the same phase of the upper part of ORF) which indicates the anticipated ORF reference sequence may be incorrect. For example, LOC164395 may be wrong form of novel human gene bankit4590055. (9) A similar but smaller protein-coding gene is anticipated in the range of the human genome sequence that has the support of EST experimental proof, so other new anticipated gene may be incorrect. For example, LOC167563 may be wrong form of CMYA5. However,these errors can be corrected or avoided by using our strategy. Here we give one example in detail: Comparision of the sequence SPRYD1 with human hypothetical gene LOC123722. The TAA bases in the position of 478-480 in LOC123722 cDNA is redundant, which causes a reading frame shift in the translation of an alternative protein. The redundancy of GTAAA of LOC123722 is not supported by our experimental clone,and is almost fully rejected by human EST alignment, and is shown as the next intron sequence by genomic GT/AG organization analysis. The verification of cDNA or genomic DNA sequence of SPRYD1 implies that LOC123722 has a wrong stop codon within its ORF because of the prediction program, thus being not complete cds. To sum up, by combining bioinformatics analyses with experimental verification, we have found that there are many errors of at least nine kinds appeared in NCBI GENOME ANNOTATION PROJECT REFSEQs through BLAST of our cloned genes in non-redundant database, and our strategy is helpful in correcting them, such as LOC14907, LOC200084 and LOC91126 (all of them should be ZNF362, but are three different kinds of wrong forms of ZNF362), three model reference sequences predicted from NCBI contig NT_004511 by automated computational analysis using gene prediction method, or such as LOC124919 and LOC147007 (both should be C17orf32, but are two different kinds of wrong forms of C17orf32), two model reference sequences predicted from NCBI contig NT_010808 by automated computational analysis using gene prediction method. Therefore, the correct identification and annotation of novel human genes may be still a heavy task, which can be finished within a long period of time. So human genome coding regions annotated by computer should be used with caution. The articles published in the past did not clearly point out the existence of mistakes in the NCBI human gene mode reference sequence. At the Seventh International Human Genome Conference held in April 2002, we first published the researching result on this aspect in the communication form of Posterly insert a base or one section of cDNA in the ORF, wrongly causing unwanted termination codon before the insertion, so the coding protein lacks the first part of the amino acids. For example, the GenBank Acc. No. AL096883 ( LOCUS No. HS323M22B) is wrong form of an experimentally verified human NM_012263 with mouse ortholog of BC010510 determined. (7) It may regard the polluted genomic sequence as complete gene cDNA sequence and anticipate the so-called single exon gene, even the real one, only a small ORF in the very long single exon mRNA, while there really exists termination code in the same phase of the upper part of the ORF initiation code, no other characters accord with the gene's condition. For example, LOC91126 is wrong form of ZNF362. (8) The anticipated genes only have ORF which has no EST proofs on both terminal sides. Depending on this ORF, a complete gene cDNA with double support of EST and human genome (there are termination codes at the same phase of the upper part of ORF) which indicates the anticipated ORF reference sequence may be incorrect. For example, LOC164395 may be wrong form of novel human gene bankit4590055. (9) A similar but smaller protein-coding gene is anticipated in the range of the human genome sequence that has the support of EST experimental proof, so other new anticipated gene may be incorrect. For example, LOC167563 may be wrong form of CMYA5. However, these errors can be corrected or avoided by using our strategy. Here we give one example in detail: Comparision of the sequence SPRYD1 with human hypothetical gene LOC123722. The TAA bases in the position of 478-480 in LOC123722 cDNA is redundant, which causes a reading frame shift in the translation of an alternative protein. The redundancy of GTAAA of LOC123722 is not supported by our experimental clone, and is almost fully rejected by human EST alignment, and is shown as the next intron sequence by genomic GT/AG organization analysis. The verification of cDNA or genomic DNA sequence of SPRYD1 implies that LOC123722 has a wrong stop codon within its ORF because of the prediction program, thus being not complete cds. To sum up, by combining bioinformatics analyses with experimental verification, we have found that there are many errors of at least nine kinds appeared in NCBI GENOME ANNOTATION PROJECT REFSEQs through BLAST of our cloned genes in non-redundant database, and our strategy is helpful in correcting them, such as LOC14907, LOC200084 and LOC91126 (all of them should be ZNF362, but are three different kinds of wrong forms of ZNF362), three model reference sequences predicted from NCBI contig NT_004511 by automated computational analysis using gene prediction method, or such as LOC124919 and LOC147007 (both should be C17orf32, but are two different kinds of wrong forms of C17orf32), two model reference sequences predicted from NCBI contig NT_010808 by automated computational analysis using gene prediction method. Therefore, the correct identification and annotation of novel human genes may be still a heavy task, which can be finished within a long period of time. So human genome coding regions annotated by computer should be used with caution. (ABSTRACT TRUNCATED)

Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies

Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers

Identification and correction of abnormal, incomplete and mispredicted proteins in public databases

Estimating the annotation error rate of curated GO database sequence annotations

Biases in the Experimental Annotations of Protein Function and Their Effect on Our Understanding of Protein Function Space

MisPred: a resource for identification of erroneous protein sequences in public databases

Gene annotation errors are common in the mammalian mitochondrial genomes database

A strategy for large-scale comparison of evolutionary- and reaction-based classifications of enzyme function

High precision multi-genome scale reannotation of enzyme function by EFICAz

Overcoming the widespread flaws in the annotation of vertebrate selenoprotein genes in public databases

CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats

Semi-Automatic Detection of Errors in Genome-Scale Metabolic Models

A Longitudinal Analysis of Function Annotations of the Human Proteome Reveals Consistently High Biases

Improving enzyme functional annotation by integrating in vitro and in silico approaches: The example of histidinol phosphate phosphatases

Automated validation of genetic variants from large databases: ensuring that variant references refer to the same genomic locations

Taxonomy annotation and guide tree errors in 16S rRNA databases

Metadata in the BioSample Online Repository are Impaired by Numerous Anomalies

[Analysis, Identification and Correction of Some Errors of Model Refseqs Appeared in NCBI Human Gene Database by in Silico Cloning and Experimental Verification of Novel Human Genes].

Interactive Tools for Functional Annotation of Bacterial Genomes

Using deep-learning predictions reveals a large number of register errors in PDB depositions

Flawed machine-learning confounds coding sequence annotation