Construction of standard human transcript dataset based on RefSeq and human genome sequence database]

Zhi-Feng Li,Yu-Jian Li,Dong-Sheng Zhao,Xing-Yi Hang,Zheng-Zhi Wang,Zhi-Gang Luo,Cheng-Gang Zhang
DOI: https://doi.org/10.3321/j.issn:0253-9772.2006.03.014
2006-01-01
Abstract:The NCBI Reference Sequence (RefSeq) database aimed to provide a biologically non-redundant collection of DNA, RNA, and protein sequences and to promote the research on genes and proteins of human beings and other species. However, because of widely distributed polymorphisms and different quality control of experiments in individual laboratories, there are potential problems need to be identified in the RefSeq database. Regarding which, we herein define the concept, standard transcript, based on the Central Dogmas of Biology that each standard transcript should be perfectly mapped to the standard genomic DNA sequence at the exon level. A large scale analysis for mapping all of the RefSeq records of human being (2005-4-18) to the officially released human genome sequence database (2005-4-20) was further performed using BLAT, Sim4 and a homemade program, EIparser, which was especially designed for this purpose. The standard transcripts based on the RefSeq database were obtained according to the alignment with standard human genome database. There are 9,771 RefSeq records of human being labeled with "NM_" and "NR_" could be perfectly mapped to human genome sequences, while other 10,943 records could be considered as standard transcripts after reasonable revision by comparing with the genome sequences according to all of the three methods. Moreover, the left 203 unrevisable records and 2,676 inconsistent records reported by the above programs could not be considered as standard transcripts and should be checked critically before using because of potential errors in them. Our study has thus provided a reference standard dataset of human beings with high quality for further bioinformatic and experimental analysis such as polymorphism and mutation of human genes. The reference standard dataset based on above criteria could be retrieved from http://biocompute.bmi.ac.cn/transcriptome/index.htm.
What problem does this paper attempt to address?