Abstract:Background The complete and accurate human reference genome is important for functional genomics researches. Therefore, the incomplete reference genome and individual specific sequences have significant effects on various studies. Results we used two RNA-Seq datasets from human brain tissues and 10 mixed cell lines to investigate the completeness of human reference genome. First, we demonstrated that in previously identified ~5 Mb Asian and ~5 Mb African novel sequences that are absent from the human reference genome of NCBI build 36, ~211 kb and ~201 kb of them could be transcribed, respectively. Our results suggest that many of those transcribed regions are not specific to Asian and African, but also present in Caucasian. Then, we found that the expressions of 104 RefSeq genes that are unalignable to NCBI build 37 in brain and cell lines are higher than 0.1 RPKM. 55 of them are conserved across human, chimpanzee and macaque, suggesting that there are still a significant number of functional human genes absent from the human reference genome. Moreover, we identified hundreds of novel transcript contigs that cannot be aligned to NCBI build 37, RefSeq genes and EST sequences. Some of those novel transcript contigs are also conserved among human, chimpanzee and macaque. By positioning those contigs onto the human genome, we identified several large deletions in the reference genome. Several conserved novel transcript contigs were further validated by RT-PCR. Conclusion Our findings demonstrate that a significant number of genes are still absent from the incomplete human reference genome, highlighting the importance of further refining the human reference genome and curating those missing genes. Our study also shows the importance of de novo transcriptome assembly. The comparative approach between reference genome and other related human genomes based on the transcriptome provides an alternative way to refine the human reference genome.

Graph pangenome reveals functional, evolutionary, and phenotypic significance of human nonreference sequences

Human pangenome analysis of sequences missing from the reference genome reveals their widespread evolutionary, phenotypic, and functional roles

Pig Pangenome Graph Reveals Functional Features of Non-Reference Sequences

De novo genome assemblies from two indigenous Americans from Arizona identify new polymorphisms in non-reference sequences

Natural Selection and Functional Potentials of Human Noncoding Elements Revealed by Analysis of Next Generation Sequencing Data.

Comparative and Functional Genomic Resource for Mechanistic Studies of Human Blood Pressure–Associated Single Nucleotide Polymorphisms

Characterizing the Genetic Polymorphisms in 370 Challenging Medically Relevant Genes Using Long-Read Sequencing Data from 41 Human Individuals among 19 Global Populations

PGG.SNV: Understanding the Evolutionary and Medical Implications of Human Single Nucleotide Variations in Diverse Populations

Revealing the missing expressed genes beyond the human reference genome by RNA-Seq

A harmonized public resource of deeply sequenced diverse human genomes

Pan-African genome demonstrates how population-specific genome graphs improve high-throughput sequencing data analysis

Long-read sequencing of 945 Han individuals identifies novel structural variants associated with phenotypic diversity and disease susceptibility

Unveiling novel genetic variants in 370 challenging medically relevant genes using the long read sequencing data of 41 samples from 19 global populations

Assembly of a pan-genome from deep sequencing of 910 humans of African descent

Worldwide DNA Sequence Variation in a 10-Kilobase Noncoding Region on Human Chromosome 22.

Complex genetic variation in nearly complete human genomes

Integrating common and rare genetic variation in diverse human populations

An integrated map of genetic variation from 1,092 human genomes

A pan-tissue, pan-disease compendium of human orphan genes

Comprehensively identifying and characterizing the missing gene sequences in human reference genome with integrated analytic approaches