Computer Science Research Seminar

Wednesday, September 27th, 2006

John Torquato

Master thesis defense
Computer Science Department
Hofstra University



A multiple sequence alignment (MSA) juxtaposes three or more sequences for the purpose of extracting commonalities and differences. In bioinformatics, MSAs are used to analyze nucleotide and protein sequences. The various algorithms for generating alignments of multiple biological sequences are guided by a variety of objective functions, such as Sum-of-Pairs (SP), consensus alignment error, and sum of edge distances. Optimizing any of the objective functions entails computational and biological complexities that intensify as three input parameters increase: sequence number, sequence length, and sequence distance. The goodness of the MSA is measured by a scoring matrix associated with the objective function. When the problem complexity is low, the computational and biological meaning of the score are tightly coupled. As the problem complexity increases, the biological value of the MSA score tends to degrade. Existing MSA methods are most often enhanced by heuristics that either speedup an underlying algorithm or import external information ad hoc. For example, shortest-path approaches driven by SP objectives reduce computation time by bounding the search space. To enhance MSAs at a higher level, primary emphasis should be placed on incorporating biological characters into the design of an objective function. The task of developing efficient solutions that are computationally feasible is subordinate to the larger goal of defining a biologically enhanced objective function. Biological characters for protein sequences are derived from attributes residing at three major levels: the primary sequence, underlying genetic code, and higher-level structure. The degree of presence or absence of biological characters among a set of protein sequences specify a phylogenetic guide tree. Using a phylogenetically correct guide tree, it will be possible to construct MSAs that intrinsically encode more biological information. The effort to understand and specify the topology of this tree is the principal driving force for this thesis.