Measuring, visualizing, and diagnosing reference bias with biastools

Mao-Jan Lin,Sheila Iyer,Nae-Chyun Chen,Ben Langmead
DOI: https://doi.org/10.1186/s13059-024-03240-8
IF: 17.906
2024-04-21
Genome Biology
Abstract:Many bioinformatics methods seek to reduce reference bias, but no methods exist to comprehensively measure it. Biastools analyzes and categorizes instances of reference bias. It works in various scenarios: when the donor's variants are known and reads are simulated; when donor variants are known and reads are real; and when variants are unknown and reads are real. Using biastools , we observe that more inclusive graph genomes result in fewer biased sites. We find that end-to-end alignment reduces bias at indels relative to local aligners. Finally, we use biastools to characterize how T2T references improve large-scale bias.
genetics & heredity,biotechnology & applied microbiology
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issue of **reference bias**. In bioinformatics analysis, sequencing reads are typically aligned to a reference genome, but this strategy can lead to a problem known as "reference bias." Specifically, alignment tools often struggle to correctly align reads containing non-reference alleles, which can result in biased and erroneous measurements. ### Background - **Definition of Reference Bias**: Reference bias refers to the situation where, due to the limitations of the reference genome, reads containing non-reference alleles are either incorrectly aligned or not aligned at all during the alignment process. - **Limitations of Existing Methods**: Although many tools attempt to reduce reference bias, there is currently a lack of comprehensive methods for measuring and diagnosing reference bias. - **Solution**: The authors propose a tool called **Biastools** for measuring, visualizing, and diagnosing reference bias. This tool can work in various scenarios, including known donor variants with simulated reads, known donor variants with real reads, and unknown variants with real reads. ### Main Findings - **Effectiveness of Graph Genomes**: The authors found that graph genomes, which include more variants, can reduce the number of sites with reference bias. - **Impact of Alignment Modes**: End-to-end alignment modes (such as the default modes of Bowtie 2 and BWA-MEM) reduce reference bias at insertion and deletion sites, whereas local alignment modes (which allow soft clipping) exhibit more bias at these sites. - **Advantages of T2T Reference Genome**: Using the T2T-CHM13 assembly in combination with the GRCh38 assembly can significantly reduce large-scale reference bias. ### Methods - **Simulation Experiments**: Generate sequencing data through simulation experiments and use different alignment tools and reference genomes to evaluate reference bias across different methods. - **Balance Measurement**: Measure three types of allele balance: Simulated Balance (SB), Mapped Balance (MB), and Assigned Balance (AB), and classify reference bias events based on these balance values. - **Application to Real Data**: Apply Biastools to real sequencing data to predict which sites are affected by reference bias and evaluate the performance of the classifier. ### Conclusion - **Effectiveness of Biastools**: Biastools can effectively measure and diagnose reference bias, helping researchers better understand the sources of bias in the alignment process. - **Future Directions**: Further optimize the tool to improve its performance on real data, particularly in accurately handling insertion and deletion sites. In summary, this paper introduces the Biastools tool, providing a comprehensive solution for measuring and diagnosing reference bias, thereby enhancing the accuracy and reliability of bioinformatics analysis.