2020
DOI: 10.21203/rs.3.rs-48247/v1
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Harnessing machine learning to boost heuristic strategies for phylogenetic-tree search

Abstract: Inferring a phylogenetic tree, which describes the evolutionary relationships among a set of organisms, genes, or genomes, is a fundamental step in numerous evolutionary studies. With the aim of making tree inference feasible for problems involving more than a handful of sequences, current algorithms for phylogenetic tree reconstruction utilize various heuristic approaches. Such approaches rely on performing costly likelihood optimizations, and thus evaluate only a subset of all potential trees. Consequently, … Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(8 citation statements)
references
References 30 publications
1
7
0
Order By: Relevance
“…This is because the features learned from the model may not correlate with existing literature that addresses factors influencing phylogenetic uncertainty. In contrast, our study's Feature Importance analyses demonstrate clear connections with previous studies, reinforcing the notion that welldefined features can inform decision-making in heuristic search moves within the tree space (Azouri et al, 2021(Azouri et al, , 2023 and predict the difficulty of a dataset for phylogenetic reconstruction (Haag et al, 2022). However, our study also suggests, particularly after analyzing the Carangaria dataset, that current machine learning models in phylogenetics that use empirical datasets should be trained with data that is agnostic to phylogenetic model assumptions such as homogeneity, stationarity, and reversibility (Pupko and Mayrose, 2020).…”
Section: Machine Learning and Feature Importancesupporting
confidence: 83%
See 2 more Smart Citations
“…This is because the features learned from the model may not correlate with existing literature that addresses factors influencing phylogenetic uncertainty. In contrast, our study's Feature Importance analyses demonstrate clear connections with previous studies, reinforcing the notion that welldefined features can inform decision-making in heuristic search moves within the tree space (Azouri et al, 2021(Azouri et al, , 2023 and predict the difficulty of a dataset for phylogenetic reconstruction (Haag et al, 2022). However, our study also suggests, particularly after analyzing the Carangaria dataset, that current machine learning models in phylogenetics that use empirical datasets should be trained with data that is agnostic to phylogenetic model assumptions such as homogeneity, stationarity, and reversibility (Pupko and Mayrose, 2020).…”
Section: Machine Learning and Feature Importancesupporting
confidence: 83%
“…The obtained insights can greatly enhance decision-making in phylogenetic analyses, aiding in the selection of appropriate DNA sequence models and data transformation methods. While machine learning has demonstrated its utility in various aspects of phylogenetic analysis, such as biogeography models (Smith et al, 2017;Fonseca et al, 2021), phylodynamic parameter estimation (Voznica et al, 2022), ancestral state reconstruction (Theobald et al, 2022), and improving heuristics for phylogenetic inference (Azouri et al, 2021(Azouri et al, , 2023, the quality of these models ultimately depends on the quality of the training data, which can inherently carry its own uncertainty, as we showed in this study. Therefore, the pursuit of identifying factors that influence the uncertainty in phylogenomic datasets and developing models that effectively capture non-linear relationships among features in phylogenetic inference is essential for advancing our understanding of the tree of life.…”
Section: Discussionmentioning
confidence: 83%
See 1 more Smart Citation
“…The code that supports the findings of this study was written in Python version 3.6 and has been deposited in Open Source Framework (OSF) with the identifier DOI 10.17605/ OSF.IO/B8AQJ 51 . Computation of likelihoods and parameter estimates were executed using the following application versions: PhyML 3.0 31 , RAxML-NG 0.9.0 48 .…”
Section: Data Availabilitymentioning
confidence: 88%
“…The datasets contained within the empirical set have been deposited in Open Source Framework (OSF) with the identifier DOI 10.17605/OSF.IO/B8AQJ 51 . These datasets were assembled from the following databases: TreeBase (https://treebase.org/treebaseweb/urlAPI.html); Selectome (https://selectome.org/); protDB (https://protdb.org/); PloiDB (https://doi.org/10.3732/ajb.1500424); PANDIT (https://www.ebi.ac.uk/research/ goldman/software/pandit); OrthoMaM (https://orthomam.mbb.cnrs.fr/).…”
Section: Data Availabilitymentioning
confidence: 99%