A New Approach for Multiple Sequence Alignment

Xu Zhang,Tamer Kahveci
2005-01-01
Abstract:1 Algorithm We introduce a new multiple sequence alignment method for protein sequences. We name our methodHSA (Horizontal Sequence Alignment) for it horizontally slides a window on the protein sequences simultaneously. 2 HSA is superior to the existing methods that depend on the order of proteins since we consider all the proteins at once. Unlike most of the existing multiple alignment methods, HSA takes secondary structure information into account to find a biologically relevant alignment. HSA uses a scoring matrix, such as BLOSUM 62 to capture substitution probabilities of amino acids. HSA runs in four steps: Step 1: (Initialization) We start by building a directed graph from the input proteins as follows. Each residue maps to a vertex in the graph. If it is available, Secondary Structure Element (SSE) type ( -helix , -sheet) of each residue is also stored along with the vertex. A directed edge from vertex i to vertex j is added if residue j immediately follows residue i in the same sequence, or residues j and i have a substitution score higher than a given threshold. A weight is also assigned to each edge based on the substitution score and SSE type. If two residues belong to the same SSE type, then we assign a larger edge weight. All sequences are then scanned to find fragments with known SSE types. These fragments will guide the alignment later. The fragments are then clustered into groups, where each group consists of one fragment from every sequence, if they satisfy following four criterion: 1) They have same SSE type. 2) They have similar number of residues. 3) Their positions in the original sequence are close. 4) The substitution score for every fragment pair is greater than a given threshold. Step 2: (Pre-alignment Adjustment) The graph constructed in step 1 is adjusted by inserting gap vertices as follows. The number of residues in fragments and the number of residues between consecutive fragments are calculated first. The count of gap vertices is then computed as a function of these two numbers. For each sequence, gap vertices are inserted to bring the fragments within the same group together. Gap vertices are positioned between consecutive fragments. This pre-alignment adjustment will move similar fragments vertically closer to each other. Thus, they will have higher probability to be aligned together in the next step. Step 3: (Alignment) In this step, the sequences are actually aligned. We start by placing a window of length w at the beginning of each sequence. Typically we use w = 4 or 6. This window defines a subgraph of the graph constructed in Step 2. Next, we greedily choose the clique with the best expectation score from this subgraph. We will explain the expectation score later. A clique here is defined as a complete subgraph of the graph with a constraint that it consists of one vertex from each sequence. In other words, if K sequences are to be aligned, a clique corresponds to the alignment of one letter from each of the K sequences. The score of a clique is defined as the SP (Sum-of-Pairs) score of the corresponding column. For each clique, we align the letters of that clique, and iteratively find the next best clique that 1) does not conflict with this clique, and 2) has at least one letter next to a letter in this clique. This iteration is repeated t times to find t columns. Typically, t = 4. These t cliques define a local alignment of the input sequences. The expectation score of the original clique is defined as the SP score of this local alignment. We then slide the window by one and repeat the same process until it reaches the end of sequences. Step 4: (Post-alignment Adjustment) In this step, the alignment obtained by the previous step is adjusted by examining the gaps. After concatenating the columns, many short gaps may be scattered in the sequence. Thus rearranging gaps may be required to construct fewer but longer gaps. Sequences are scanned again to find
What problem does this paper attempt to address?