Alignment Metric Accuracy

Ariel S. Schwartz,Eugene W. Myers,Lior Pachter
DOI: https://doi.org/10.48550/arXiv.q-bio/0510052
2005-10-28
Abstract:We propose a metric for the space of multiple sequence alignments that can be used to compare two alignments to each other. In the case where one of the alignments is a reference alignment, the resulting accuracy measure improves upon previous approaches, and provides a balanced assessment of the fidelity of both matches and gaps. Furthermore, in the case where a reference alignment is not available, we provide empirical evidence that the distance from an alignment produced by one program to predicted alignments from other programs can be used as a control for multiple alignment experiments. In particular, we show that low accuracy alignments can be effectively identified and discarded. We also show that in the case of pairwise sequence alignment, it is possible to find an alignment that maximizes the expected value of our accuracy measure. Unlike previous approaches based on expected accuracy alignment that tend to maximize sensitivity at the expense of specificity, our method is able to identify unalignable sequence, thereby increasing overall accuracy. In addition, the algorithm allows for control of the sensitivity/specificity tradeoff via the adjustment of a single parameter. These results are confirmed with simulation studies that show that unalignable regions can be distinguished from homologous, conserved sequences. Finally, we propose an extension of the pairwise alignment method to multiple alignment. Our method, which we call AMAP, outperforms existing protein sequence multiple alignment programs on benchmark datasets. A webserver and software downloads are available at <a class="link-external link-http" href="http://bio.math.berkeley.edu/amap/" rel="external noopener nofollow">this http URL</a> .
Quantitative Methods,Statistics Theory
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the accuracy of sequence alignment and evaluation methods. Specifically, the author proposes a new metric to compare two multiple sequence alignments and addresses the following key issues: 1. **Alignment accuracy evaluation**: - A new metric (Alignment Metric Accuracy, AMA) is proposed to evaluate the similarity between two alignments. - This metric can be used to compare the predicted alignment with the reference alignment, thus providing a more balanced evaluation method that takes into account the accuracy of matches and gaps. 2. **Control in the absence of reference alignment**: - When there is no reference alignment, the author provides experimental evidence showing that the distance between alignments generated by different programs can be measured as an experimental control method. - Alignments with low accuracy can be effectively identified and discarded, thereby improving the overall alignment quality. 3. **Optimization in pairwise sequence alignment**: - In pairwise sequence alignment, the author proposes a method to maximize the expected value of AMA. This method not only improves sensitivity but also can identify non - alignable sequences, thus increasing the overall accuracy. - By adjusting a parameter (gap - factor), a trade - off can be made between sensitivity and specificity. 4. **Extension of multiple sequence alignment**: - The method of pairwise sequence alignment is extended to multiple sequence alignment, and the AMAP algorithm is proposed. - Experimental results show that AMAP outperforms existing multiple protein sequence alignment programs on the benchmark data set. ### Formula summary - **Metric definition**: \[ d(h_i, h_j) = n + m - 2| h_i^H \cap h_j^H | - | h_i^I \cap h_j^I | - | h_i^D \cap h_j^D | \] where \( h_i^H \), \( h_i^I \) and \( h_i^D \) represent the sets of matching pairs, insertions and deletions in alignment \( h_i \), respectively. - **AMA definition**: \[ g(h_p, h_r) = 1 - \frac{d(h_p, h_r)}{n + m} \] - **AMAP algorithm objective function**: \[ h_p = \arg\max_{h \in A_{n,m}} \left( \sum_{(i,j) \in h} H P(\sigma_1^i ✸ \sigma_2^j | \sigma_1, \sigma_2, \theta) + Gf \sum_{i \in h} D P(\sigma_1^i ✸ - | \sigma_1, \sigma_2, \theta) + Gf \sum_{j \in h} I P(\sigma_2^j ✸ - | \sigma_1, \sigma_2, \theta) \right) \] ### Conclusion By introducing new metrics and optimization algorithms, this paper aims to improve the accuracy and reliability of existing alignment methods, especially when dealing with complex and unrelated sequences. These improvements are of great significance for research in the field of bioinformatics, especially in genomics and protein structure analysis.