Harnessing machine learning to boost heuristic strategies for phylogenetic-tree search

Azouri, Dana; Abadi, Shiran; Mansour, Yishay; Mayrose, Itay; Pupko, Tal

doi:10.21203/rs.3.rs-48247/v1

Cited by 5 publications

(8 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is because the features learned from the model may not correlate with existing literature that addresses factors influencing phylogenetic uncertainty. In contrast, our study's Feature Importance analyses demonstrate clear connections with previous studies, reinforcing the notion that welldefined features can inform decision-making in heuristic search moves within the tree space (Azouri et al, 2021(Azouri et al, , 2023 and predict the difficulty of a dataset for phylogenetic reconstruction (Haag et al, 2022). However, our study also suggests, particularly after analyzing the Carangaria dataset, that current machine learning models in phylogenetics that use empirical datasets should be trained with data that is agnostic to phylogenetic model assumptions such as homogeneity, stationarity, and reversibility (Pupko and Mayrose, 2020).…”

Section: Machine Learning and Feature Importancesupporting

confidence: 83%

“…The obtained insights can greatly enhance decision-making in phylogenetic analyses, aiding in the selection of appropriate DNA sequence models and data transformation methods. While machine learning has demonstrated its utility in various aspects of phylogenetic analysis, such as biogeography models (Smith et al, 2017;Fonseca et al, 2021), phylodynamic parameter estimation (Voznica et al, 2022), ancestral state reconstruction (Theobald et al, 2022), and improving heuristics for phylogenetic inference (Azouri et al, 2021(Azouri et al, , 2023, the quality of these models ultimately depends on the quality of the training data, which can inherently carry its own uncertainty, as we showed in this study. Therefore, the pursuit of identifying factors that influence the uncertainty in phylogenomic datasets and developing models that effectively capture non-linear relationships among features in phylogenetic inference is essential for advancing our understanding of the tree of life.…”

Section: Discussionmentioning

confidence: 83%

“…This presents an unprecedented opportunity to train machine learning models that can potentially learn complex non-linear relationships within the phylogenomic dataset without imposing a fixed set of probability distribution families to construct the model (i.e., non-parametric models) (Hastie et al 2009;Murphy 2012;Bokulich et al 2018). Machine learning models, capable of identifying complex non-linear relationships, have recently found applications in various phylogenetic analyses, including fast phylogenetic inference (Azouri et al 2021(Azouri et al , 2023, phylogenetic placement (Jiang, Tabaghi, and Mirarab 2022), learning phylodynamic parameters (Voznica 2021), and selecting biogeographical models from simulated data (Smith et al 2017;Burbrink and Gehara 2018). While most of these studies focus on prediction tasks, exploring the interactions of features in specific phylogenetic datasets leveraging these non-linear models remains largely unexplored.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Dissecting Factors Underlying Phylogenetic Uncertainty Using Machine Learning Models

Rosas-Puchuri,

Duarte-Ribeiro,

Khanmohammadi

et al. 2023

Preprint

View full text Add to dashboard Cite

Phylogenetic inference can be influenced by both underlying biological processes and methodological factors. While biological processes can be modeled, these models frequently make the assumption that methodological factors do not significantly influence the outcome of phylogenomic analyses. Depending on their severity, methodological factors can introduce inconsistency and uncertainty into the inference process. Although search protocols have been proposed to mitigate these issues, many solutions tend to treat factors independently or assume a linear relationship among them. In this study, we capitalize on the increasing size of phylogenetic datasets, using them to train machine learning models. This approach transcends the linearity assumption, accommodating complex non-linear relationships among features. We examined two phylogenomic datasets for teleost fishes: a newly generated dataset for protacanthopterygians (salmonids, galaxiids, marine smelts, and allies), and a reanalysis of a dataset for carangarians (flatfishes and allies). Upon testing five supervised machine learning models, we found that all outperformed the linear model (p < 0.05), with the deep neural network showing the best fit for both empirical datasets tested. Feature importance analyses indicated that influential factors were specific to individual datasets. The insights obtained have the potential to significantly enhance decision-making in phylogenetic analyses, assisting, for example, in the choice of suitable DNA sequence models and data transformation methods. This study can serve as a baseline for future endeavors aiming to capture non-linear interactions of features in phylogenomic datasets using machine learning and complement existing tools for phylogenetic analyses.

show abstract

Section: Machine Learning and Feature Importancesupporting

confidence: 83%

Section: Discussionmentioning

confidence: 83%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Dissecting Factors Underlying Phylogenetic Uncertainty Using Machine Learning Models

Rosas-Puchuri,

Duarte-Ribeiro,

Khanmohammadi

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…The code that supports the findings of this study was written in Python version 3.6 and has been deposited in Open Source Framework (OSF) with the identifier DOI 10.17605/ OSF.IO/B8AQJ 51 . Computation of likelihoods and parameter estimates were executed using the following application versions: PhyML 3.0 31 , RAxML-NG 0.9.0 48 .…”

Section: Data Availabilitymentioning

confidence: 88%

“…The datasets contained within the empirical set have been deposited in Open Source Framework (OSF) with the identifier DOI 10.17605/OSF.IO/B8AQJ 51 . These datasets were assembled from the following databases: TreeBase (https://treebase.org/treebaseweb/urlAPI.html); Selectome (https://selectome.org/); protDB (https://protdb.org/); PloiDB (https://doi.org/10.3732/ajb.1500424); PANDIT (https://www.ebi.ac.uk/research/ goldman/software/pandit); OrthoMaM (https://orthomam.mbb.cnrs.fr/).…”

Section: Data Availabilitymentioning

confidence: 99%

Harnessing machine learning to guide phylogenetic-tree search algorithms

et al. 2021

Self Cite

View full text Add to dashboard Cite

Inferring a phylogenetic tree is a fundamental challenge in evolutionary studies. Current paradigms for phylogenetic tree reconstruction rely on performing costly likelihood optimizations. With the aim of making tree inference feasible for problems involving more than a handful of sequences, inference under the maximum-likelihood paradigm integrates heuristic approaches to evaluate only a subset of all potential trees. Consequently, existing methods suffer from the known tradeoff between accuracy and running time. In this proof-of-concept study, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus potentially accelerating heuristic tree searches without losing accuracy. Our analyses suggest that machine learning can guide tree-search methodologies towards the most promising candidate trees.

show abstract

Differentiable Search of Evolutionary Trees

Hettiarachchi¹,

Swartz²,

Овчинников³

2023

Preprint

View full text Add to dashboard Cite

Inferring the most probable evolutionary tree given leaf nodes is an important problem in computational biology that reveals the evolutionary relationships between species. Due to the exponential growth of possible tree topologies, finding the best tree in polynomial time becomes computationally infeasible. In this work, we propose a novel differentiable approach as an alternative to traditional heuristic-based combinatorial tree search methods in phylogeny. The optimization objective of interest in this work is to find the most parsimonious tree (i.e., to minimize the total number of evolutionary changes in the tree). We empirically evaluate our method using randomly generated trees of up to 128 leaves, with each node represented by a 256-length protein sequence. Our method exhibits promising convergence (< 1% error for trees up to 32 leaves, < 8% error up to 128 leaves, given only leaf node information), illustrating its potential in much broader phylogenetic inference problems and possible integration with end-to-end differentiable models. The code to reproduce the experiments in this paper can be found at https://github.ramith.io/diff-evol-tree-search.

show abstract

Harnessing machine learning to boost heuristic strategies for phylogenetic-tree search

Cited by 5 publications

References 30 publications

Dissecting Factors Underlying Phylogenetic Uncertainty Using Machine Learning Models

Dissecting Factors Underlying Phylogenetic Uncertainty Using Machine Learning Models

Harnessing machine learning to guide phylogenetic-tree search algorithms

Differentiable Search of Evolutionary Trees

Contact Info

Product

Resources

About