Abstract:Many bioinformatics methods seek to reduce reference bias, but no methods exist to comprehensively measure it. Biastools analyzes and categorizes instances of reference bias. It works in various scenarios: when the donor's variants are known and reads are simulated; when donor variants are known and reads are real; and when variants are unknown and reads are real. Using biastools , we observe that more inclusive graph genomes result in fewer biased sites. We find that end-to-end alignment reduces bias at indels relative to local aligners. Finally, we use biastools to characterize how T2T references improve large-scale bias.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issue of **reference bias**. In bioinformatics analysis, sequencing reads are typically aligned to a reference genome, but this strategy can lead to a problem known as "reference bias." Specifically, alignment tools often struggle to correctly align reads containing non-reference alleles, which can result in biased and erroneous measurements. ### Background - **Definition of Reference Bias**: Reference bias refers to the situation where, due to the limitations of the reference genome, reads containing non-reference alleles are either incorrectly aligned or not aligned at all during the alignment process. - **Limitations of Existing Methods**: Although many tools attempt to reduce reference bias, there is currently a lack of comprehensive methods for measuring and diagnosing reference bias. - **Solution**: The authors propose a tool called **Biastools** for measuring, visualizing, and diagnosing reference bias. This tool can work in various scenarios, including known donor variants with simulated reads, known donor variants with real reads, and unknown variants with real reads. ### Main Findings - **Effectiveness of Graph Genomes**: The authors found that graph genomes, which include more variants, can reduce the number of sites with reference bias. - **Impact of Alignment Modes**: End-to-end alignment modes (such as the default modes of Bowtie 2 and BWA-MEM) reduce reference bias at insertion and deletion sites, whereas local alignment modes (which allow soft clipping) exhibit more bias at these sites. - **Advantages of T2T Reference Genome**: Using the T2T-CHM13 assembly in combination with the GRCh38 assembly can significantly reduce large-scale reference bias. ### Methods - **Simulation Experiments**: Generate sequencing data through simulation experiments and use different alignment tools and reference genomes to evaluate reference bias across different methods. - **Balance Measurement**: Measure three types of allele balance: Simulated Balance (SB), Mapped Balance (MB), and Assigned Balance (AB), and classify reference bias events based on these balance values. - **Application to Real Data**: Apply Biastools to real sequencing data to predict which sites are affected by reference bias and evaluate the performance of the classifier. ### Conclusion - **Effectiveness of Biastools**: Biastools can effectively measure and diagnose reference bias, helping researchers better understand the sources of bias in the alignment process. - **Future Directions**: Further optimize the tool to improve its performance on real data, particularly in accurately handling insertion and deletion sites. In summary, this paper introduces the Biastools tool, providing a comprehensive solution for measuring and diagnosing reference bias, thereby enhancing the accuracy and reliability of bioinformatics analysis.

Measuring, visualizing, and diagnosing reference bias with biastools

Measuring, visualizing and diagnosing reference bias with biastools

Unravelling reference bias in ancient DNA datasets

A Comprehensive Evaluation of Alignment Software for Reduced Representation Bisulfite Sequencing Data

Open-source benchmarking of IBD segment detection methods for biobank-scale cohorts

A test metric for assessing single-cell RNA-seq batch correction

Bias in Estimates of Quantitative-Trait–Locus Effect in Genome Scans: Demonstration of the Phenomenon and a Method-of-Moments Procedure for Reducing Bias

A Basic Tool for Background and Shading Correction of Optical Microscopy Images

Deep-BIAS: Detecting Structural Bias using Explainable AI

Assessment of batch-correction methods for scRNA-seq data with a new test metric

A Method to Correct Systematic Bias in Affymetrix SNP Arrays

Case-specific selection of batch correction methods for integrating single-cell transcriptomic data from different sources

Embracing the informative missingness and silent gene in analyzing biologically diverse samples

Minimizing Reference Bias with an Impute-First Approach

Measuring Risk of Bias in Biomedical Reports: The RoBBR Benchmark

The Biases of Copy Numbers from Affymetrix SNP Arrays and Their Corrections.

ABDS: a bioinformatics tool suite for analyzing biologically diverse samples

The analysis of biases of copy numbers from affymetrix snp arrays

Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery

Benchmarking Recent Computational Tools for DNA-binding Protein Identification

Bipol: Multi-axes Evaluation of Bias with Explainability in Benchmark Datasets