Data Science DirectorHuntington National Bank
Bayesian Structural Phylogenetics
This thesis concerns the use of protein structure to improve phylogenetic inference. There has been growing interest in phylogenetics as the number of available DNA and protein sequences continues to grow rapidly and demand from other scientific fields increases. It is now well understood that phylogenies should be inferred jointly with alignment through use of stochastic evolutionary models. It has not been possible, however, to incorporate protein structure in this framework. Protein structure is more strongly conserved than sequence over long distances, so an important source of information, particularly for alignment, has been left out of analyses. I present a stochastic process model for the joint evolution of protein primary and tertiary structure, suitable for use in alignment and estimation of phylogeny. Indels arise from a classic Links model and mutations follow a standard substitution matrix, while backbone atoms diffuse in three-dimensional space according to an Ornstein-Uhlenbeck process. The model allows for simultaneous estimation of evolutionary distances, indel rates, structural drift rates, and alignments, while fully accounting for uncertainty. The inclusion of structural information enables pairwise evolutionary distance estimation on time scales not previously attainable with sequence evolution models. Ideally inference should not be performed in a pairwise fashion between proteins, but in a fully Bayesian setting simultaneously estimating the phylogenetic tree, alignment, and model parameters. I extend the initial pairwise model to this framework and explore model variants which improve agreement between sequence and structure information. The model also allows for estimation of heterogeneous rates of structural evolution throughout the tree, identifying groups of proteins structurally evolving at different speeds. In order to explore the posterior over topologies by Markov chain Monte Carlo sampling, I also introduce novel topology + alignment proposals which greatly improve mixing of the underlying Markov chain. I show that the inclusion of structural information reduces both alignment and topology uncertainty. The software is available as plugin to the package StatAlign. Finally, I also examine limits on statistical inference of phylogeny through sequence information models. These limits arise due to the `cutoff phenomenon,' a term from probability which describes processes which remain far from their equilibrium distribution for some period of time before swiftly transitioning to stationarity. Evolutionary sequence models all exhibit a cutoff; I show how to find the cutoff for specific models and sequences and relate the cutoff explicitly to increased uncertainty in inference of evolutionary distances. I give theoretical results for symmetric models, and demonstrate with simulations that these results apply to more realistic and widespread models as well. This analysis also highlights several drawbacks to common default priors for phylogenetic analysis, I and suggest a more useful class of priors.