Abstract:Background: Software for labeling biological sequences typically produces a theory-based statistic for each match (the E-value) that indicates the likelihood of seeing that match's score by chance. E-values accurately predict false match rate for comparisons of random (shuffled) sequences, and thus provide a reasoned mechanism for setting score thresholds that enable high sensitivity with low expected false match rate. This threshold-setting strategy is challenged by real biological sequences, which contain regions of local repetition and low sequence complexity that cause excess matches between non-homologous sequences. Knowing this, tool developers often develop benchmarks that use realistic-seeming decoy sequences to explore empirical tradeoffs between sensitivity and false match rate. A recent trend has been to employ reversed biological sequences as realistic decoys, because these preserve the distribution of letters and the existence of local repeats, while disrupting the original sequence's functional properties. However, we and others have observed that sequences appear to produce high scoring alignments to their reversals with surprising frequency, leading to overstatement of false match risk that may negatively affect downstream analysis. Results: We demonstrate that an alignment between a sequence S and its (possibly mutated) reversal tends to produce higher scores than alignment between truly unrelated sequences, even when S is a shuffled string with no notable repetitive or low-complexity regions. This phenomenon is due to the unintuitive fact that (even randomly shuffled) sequences contain palindromes that are on average longer than the longest common substrings (LCS) shared between permuted variants of the same sequence. Though the expected palindrome length is only slightly larger than the expected LCS, the distribution of alignment scores involving reversed sequences is strongly right-shifted, leading to greatly increased frequency of high-scoring alignments to reversed sequences. Impact: Overestimates of false match risk can motivate unnecessarily high score thresholds, leading to potentially reduced true match sensitivity. Also, when tool sensitivity is only reported up to the score of the first matched decoy sequence, a large decoy set consisting of reversed sequences can obscure sensitivity differences between tools. As a result of these observations, we advise that reversed biological sequences be used as decoys only when care is taken to remove positive matches in the original (un-reversed) sequences, or when overstatement of false labeling is not a concern. Though the primary focus of the analysis is on sequence annotation, we also demonstrate that the prevalence of internal palindromes may lead to an overstatement of the rate of false labels in protein identification with mass spectrometry.

WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences

CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats

Estimate the Occurrence Rate of the DNA Palindromes

An efficient Z-score algorithm for assessing sequence alignments

Palindromic Sequence Artifacts Generated during Next Generation Sequencing Library Preparation from Historic and Ancient DNA

Which way up? Recognition of homologous DNA segments in parallel and antiparallel alignment

Flawed machine-learning confounds coding sequence annotation

DecoyFinder: Identification of Contaminants in Sets of Homologous RNA Sequences

Reinvestigating the Correctness of Decoy-Based False Discovery Rate Control in Proteomics Tandem Mass Spectrometry

Fundamental Bounds and Approaches to Sequence Reconstruction from Nanopore Sequencers

Sequence alignment using large protein structure alphabets improves sensitivity to remote homologs

Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation

2D representations of DNA sequence show that most transversions are misaligned nucleotides associated with replication slippage.

Exploiting protein language model sequence representations for repeat detection

Designing efficient randstrobes for sequence similarity analyses

Synonymous and nonsynonymous distances help untangle convergent evolution and recombination

Unravelling reference bias in ancient DNA datasets

ProbAlign: a re-alignment method for long sequencing reads

Neglecting the impact of normalization in semi-synthetic RNA-seq data simulations generates artificial false positives

Artefacts and biases affecting the evaluation of scoring functions on decoy sets for protein structure prediction

Characterization of pairwise and multiple sequence alignment errors