Bayesian phylogenetic inference is an important alternative to maximum likelihood-based phylogenetic method. However, inferring large trees using the Bayesian approach is computationally demanding-requiring huge amounts of memory and months of computational time. With a combination of novel parallel algorithms and latest system technology, terascale phylogenetic tools will provide biologists the computational power necessary to conduct experiments on very large dataset, and thus aid construction of the tree of life.In this work we evaluate the performance of PBPI, a parallel application that reconstructs phylogenetic trees using MCMC-based Bayesian methods, on two terascale systems, Blue Gene/L at IBM Rochester and System X at Virginia Tech. Our results confirm that for a benchmark dataset with 218 taxa and 10000 characters, PBPI can achieve linear speedup on 1024 or more processors for both systems.
IntroductionPhylogeny, a tree or network-like structure representing the evolutionary relationship among a group of species, serves as an important framework to organize, compare, and analyze biological data. Besides its primary role in understanding biological evolution and diversity, it has also been widely used in many other areas including genetics, genomics, drug discovery, plant improvement, and disease control. The importance of phylogeny to science and society can be best demonstrated by the NSF ATOL project [1], whose goal is to provide an overall framework for retrieving, comparing, and predicating huge amounts of biological data by "assembling a tree of life for 1.7 million described species on the earth".The fundamental task of most phylogenetic inference is to estimate the "correct" phylogenetic trees given one or multiple data sets which encode the clues for the evolutionary path. Among various phylogenetic inference approaches, the Bayesian approach distinguishes itself in several aspects. First, it uses explicit models of evolution and likelihood functions similar to maximum likelihood estimation, another important statistical phylogenetic method. The Bayesian approach has the potential to incorporate complicated models and existing knowledge into the process of phylogenetic inference. Second, it takes a probabilistic view of the estimated trees and ranks these trees with a quantity called posterior probability. Bayesian phylogenetic inference avoids the baffle present in many NP-hard optimality methods that output one "best" tree.Building large phylogenetic trees using Bayesian approach is computationally demanding. For example, building a phylogenetic tree with hundreds of taxa and thousands of characters may require several gigabytes of memory usage and several months of computing time. To make Bayesian phylogenetic inference more efficient and more practical for large phylogenetic problems, it is necessary to run phylogenetic tools on terascale systems.The main contributions of this paper are in two folds. First, we provide the excellent scaling results of PBPI, a parallel Bayesian phylo...