Abstract:Found that there exist many mistakes in the REFSEQ issued in the genome annotation project of NCBI, the result of which indicates that people be cautious in using REFSEQ database in NCBI. By adopting the technical route combining bioinformatics analysis and experimental verification, through the comparison of the cloned genes in the non-redundant database, we found that there were many mistakes in the computer annotation human genome coding sequences that were issued on the internet. First we quoted nine wrong types of novel human genes anticipated by NCBI GENOME Annotation Project. Here we give one example in detail: (1) Comparison of the sequences between novel human gene C17orf32 and hypothetical human gene LOC124919. LOC123722 is a modified sequence of C17orf32 cDNA with an inserted G between 406 -407 nucleotides. The base G in the 401 position of LOC123722 cDNA is a redundant insert, which causes a reading frame shift in the translation of an alternative protein. This inserted G has not been found in our experimental clone, and is fully rejected by human EST alignment, and is shown as a redundance by genomic GT/AG organization analysis. (2) Comparison of the sequences between novel human gene C17orf32 and hypothetical human gene LOC147007. C17orf32 gene (ORF from 31 to 657 nucleotides) is located on human chromosome 17(Accession No. NT_010808.7), and is only linked with a hypothetical human gene LOC147007 (ORF from 55 to 435 nucleotides) at present. This hypothetical human gene sequence has not been verified by experiment, and is a wrong form of our verified C17orf32 gene. The full-length 1 679 bp cDNA sequence of C17orf32 exhibits overall homology to that of LOC147007 of 625 bp mRNA, with matching percentage of 37% in 36% of total window over the full-length nucleotide, especially 121 approximately 366 bp of LOC147007 is just the same as 316 approximately 561 bp of C17orf32. Thus, the 126 aa protein encoded by XP_097165 of LOC147007 exhibits overall homology to the 208 aa protein encoded by C17orf32, with matching percentage of 50% in 48% of total window over the full-length protein, especially 23 approximately 104 aa of XP_097165 is just the same as 96 approximately 177 aa of C17orf32 protein. Both flanking regions of LOC147007 outside the same ORF central part are wrong assembly of non-relative cDNA. In addition, we have in silico cloned a novel mouse gene, ORF32 (open reading frame 32) with TPA accession number of BK000258, which is the mouse ortholog of human C17orf32. Our strategy is helpful in both finding out more novel human genes and correcting the mistakes in the REFSEQs issued by NCBI genome annnotation project. For example, we adopted the gene anticipating method, through automatic calculation and analysis, anticipated two modes reference sequences (LOC124919 and LOC147007) from NCBI contig NT_ 010808. Both of them should be C17orf32, but the fact is that both of them are various wrong forms of C17orf32, respectively are the first type and second type of mistakes. Another example, we adopted gene anticipation method, through automatic calculation and analysis, anticipated three modes reference sequences (LOC14907, LOC200084 and LOC91126) from NCBI contig NT_004511 which really are one type of gene of ZNF362, but submitted three different wrong forms of ZNF362, respectively are: the fourth, fifth, and seventh type of mistakes. We can correct or avoid the currently wrong human genome coding sequence by using in silico clone and combining experimental verification. People should be cautious in treating the computer's annotation which may exist all type of wrong human genome coding sequences. The correct identification and annotation of the novel human genes still remain to be a long and arduous task.

Construction of standard human transcript dataset based on RefSeq and human genome sequence database]

Construction and Application of the Web Service "stdtransdb" for Database of Standard Transcript Sequences of Human and Model Species

Construction of an Open-Access Database That Integrates Cross-Reference Information from the Transcriptome and Proteome of Immune Cells

Revealing the missing expressed genes beyond the human reference genome by RNA-Seq

[Correction of Five Different Types of Errors of Model REFSEQs Appeared in NCBI Human Gene Database Only by Using Two Novel Human Genes C17orf32 and ZNF362].

NCBI RefSeq: reference sequence standards through 25 years of curation and annotation

NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins

A Multi-Omics Dataset of Human Transcriptome and Proteome Stable Reference

NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

Comprehensively identifying and characterizing the missing gene sequences in human reference genome with integrated analytic approaches

A comprehensive rat transcriptome built from large scale RNA-seq-based annotation

Incorporating the Human Gene Annotations in Different Databases Significantly Improved Transcriptomic and Genetic Analyses.

A stable reference human transcriptome and proteome as a standard for reproducible omics experiments

Thousands of large-scale RNA sequencing experiments yield a comprehensive new human gene list and reveal extensive transcriptional noise

The complete sequence of a human genome

Revealing Missing Isoforms Encoded in the Human Genome by Integrating Genomic, Transcriptomic and Proteomic Data

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

An Evaluation of Public Genomic References for Mapping Rna-Seq Data from Chinese Hamster Ovary Cells

Knowledge-Based Reconstruction of Mrna Transcripts with Short Sequencing Reads for Transcriptome Research

Assembly, Annotation, and Integration of UNIGENE Clusters into the Human Genome Draft