Motivation
FastTree-2 is one of the most successful tools for inferring large phylogenies. With speed at the core of its design, there are still important issues in the FastTree-2 implementation that harm its performance and scalability. To deal with these limitations we introduce VeryFastTree, a highly-tuned implementation of the FastTree-2 tool that takes advantage of parallelization and vectorization strategies to boost performance.
Results
VeryFastTree is able to construct a tree on a standard server using double precision arithmetic from an ultra-large 330k alignment in only 4.5 hours, which is 7.8× and 3.5× faster than the sequential and best parallel FastTree-2 times, respectively.
Availability
VeryFastTree is available at the GitHub repository: https://github.com/citiususc/veryfasttree
Supplementary information
Supplementary data are available at Bioinformatics online.
The
k
-nearest-neighbors (
k
NN) graph is a popular and powerful data structure that is used in various areas of Data Science, but the high computational cost of obtaining it hinders its use on large datasets. Approximate solutions have been described in the literature using diverse techniques, among which Locality-sensitive Hashing (LSH) is a promising alternative that still has unsolved problems. We present Variable Resolution Locality-sensitive Hashing, an algorithm that addresses these problems to obtain an approximate
k
NN graph at a significantly reduced computational cost. Its usability is greatly enhanced by its capacity to automatically find adequate hyperparameter values, a common hindrance to LSH-based methods. Moreover, we provide an implementation in the distributed computing framework Apache Spark that takes advantage of the structure of the algorithm to efficiently distribute the computational load across multiple machines, enabling practitioners to apply this solution to very large datasets. Experimental results show that our method offers significant improvements over the state-of-the-art in the field and shows very good scalability as more machines are added to the computation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.