Detecting Positively Selected Sites from Amino Acid Sequences: an Implicit Codon Model

Zheng Ouyang,Jie Liang
DOI: https://doi.org/10.1109/iembs.2007.4353538
2007-01-01
Abstract:Fixation of advantageous mutations is an important evolutionary force driving the accelerated protein diversification. However, the standard phylogenetic approach to infer positive selection is based on relative rate of nonsynonymous to synonymous substitutions, and requires the knowledge of DNA sequences, hence precludes its application to family of remotely related sequences where saturated substitution occur. In this study, we develop a new method to detect positive selection directly from amino acid sequences by treating codon usage as hidden parameters. For a given amino acid sequence set and a phylogenetic tree, we use a reversible continuous time Markov process as our evolutionary model. This model has fewer parameters than normal amino acid evolutionary model, with only transition/transversion rate ratio, nonsynonymous/synonymous rate ratio (omega = d N /d S ), and codon usage. Similar to earlier work, we assume that omega is a random variable with different probabilities to take a set of discrete values. Those with omega>1 model sites under positive selection. We use the Bayesian Monte Carlo method to estimate model parameters, as it allows implementation of complex model of sequence evolution. Here unobserved DNA sequences are sampled from protein sequences based on distributions parametrized by codon usages, based on the fact that both protein sequences and the native protein-encoding DNA sequences have the same phylogenetic tree. The object is that sampled DNA sequences should fit the same phylogenetic tree as well as the native DNA sequences. Data set of beta-globin sequences from vertebrates is used to verify our model. We are able to detect all eight positive selection sites, which were originally reported using native nucleotide sequences. Our work shows that although nonsynonymous/synonymous rate ratio is defined at codon level, it can be used to detect selective pressures of amino acid sequences by our implicit codon-based model.
What problem does this paper attempt to address?