Metagenomic coverage bias at transcription start sites is correlated with gene expression

Gordon Qian,Izaak Coleman,Tal Korem,Joshua W.K. Ho
DOI: https://doi.org/10.1101/2024.05.09.593333
2024-05-13
Abstract:Metagenomic sequencing is presumed to provide unbiased sampling of all the genetic material in a sample. Downstream analysis methods, such as binning, gene copy number analysis, structural variations, or single nucleotide polymorphism analysis, commonly assume an even distribution across the genome after accounting for known artefacts such as GC content. We discovered coverage bias across gut microbiome species, manifesting as a difference in coverage before and after bacterial transcription start sites. Using matched metatranscriptomic and metagenomic sequencing data, we demonstrate that this bias correlates with gene expression. Potential artefacts such as the sequencing technology, reference genome used for alignment, and mappability bias were investigated across multiple datasets and shown to not be factors for association. While GC bias was found correlated with coverage bias, the association of coverage bias with gene expression remains significant after adjusting for GC bias. Paired-end read mapping demonstrated an enrichment in 5’ read ends immediately downstream of the TSS which was partly a byproduct of unmapped reads upstream of the TSS. Our observations suggest the existence of strain-level variation where sequence variation in the promoter site region is preventing proper read alignment to the reference genome. The correlation of this phenomenon with gene expression may also reflect evolutionary footprints for fine-tuning the regulation of gene expression. Understanding the source of this sequence variation and the biological implications of this artefact will be useful not only to better characterise microbial functions but also to improve interpretations of strain level dynamics.
Bioinformatics
What problem does this paper attempt to address?