Seasonal influenza viruses are constantly changing and produce a different set of circulating strains each season. Small genetic changes can accumulate over time and result in antigenically different viruses; this may prevent the body’s immune system from recognizing those viruses. Due to rapid mutations, in particular, in the haemagglutinin (HA) gene, seasonal influenza vaccines must be updated frequently. This requires choosing strains to include in the updates to maximize the vaccines’ benefits, according to estimates of which strains will be circulating in upcoming seasons. This is a challenging prediction task. In this paper, we use longitudinally sampled phylogenetic trees based on HA sequences from human influenza viruses, together with counts of epitope site polymorphisms in HA, to predict which influenza virus strains are likely to be successful. We extract small groups of taxa (subtrees) and use a suite of features of these subtrees as key inputs to the machine learning tools. Using a range of training and testing strategies, including training on H3N2 and testing on H1N1, we find that successful prediction of future expansion of small subtrees is possible from these data, with accuracies of 0.71–0.85 and a classifier ‘area under the curve’ 0.75–0.9.
The shape of phylogenetic trees can be used to gain evolutionary insights. A tree’s shape specifies the connectivity of a tree, while its branch lengths reflect either the time or genetic distance between branching events; well-known measures of tree shape include the Colless and Sackin imbalance, which describe the asymmetry of a tree. In other contexts, network science has become an important paradigm for describing structural features of networks and using them to understand complex systems, ranging from protein interactions to social systems. Network science is thus a potential source of many novel ways to characterize tree shape, as trees are also networks. Here, we tailor tools from network science, including diameter, average path length, and betweenness, closeness, and eigenvector centrality, to summarize phylogenetic tree shapes. We thereby propose tree shape summaries that are complementary to both asymmetry and the frequencies of small configurations. These new statistics can be computed in linear time and scale well to describe the shapes of large trees. We apply these statistics, alongside some conventional tree statistics, to phylogenetic trees from three very different viruses (HIV, dengue fever and measles), from the same virus in different epidemiological scenarios (influenza A and HIV) and from simulation models known to produce trees with different shapes. Using mutual information and supervised learning algorithms, we find that the statistics adapted from network science perform as well as or better than conventional statistics. We describe their distributions and prove some basic results about their extreme values in a tree. We conclude that network science-based tree shape summaries are a promising addition to the toolkit of tree shape features. All our shape summaries, as well as functions to select the most discriminating ones for two sets of trees, are freely available as an R package at http://github.com/Leonardini/treeCentrality.
Phylogenetic trees are frequently used in biology to study the relationships between a number of species or organisms. The shape of a phylogenetic tree contains useful information about patterns of speciation and extinction, so powerful tools are needed to investigate the shape of a phylogenetic tree. Tree shape statistics are a common approach to quantifying the shape of a phylogenetic tree by encoding it with a single number. In this article, we propose a new resolution function to evaluate the power of different tree shape statistics to distinguish between dissimilar trees. We show that the new resolution function requires less time and space in comparison with the previously proposed resolution function for tree shape statistics. We also introduce a new class of tree shape statistics, which are linear combinations of two existing statistics that are optimal with respect to a resolution function, and show evidence that the statistics in this class converge to a limiting linear combination as the size of the tree increases. Our implementation is freely available at https://github.com/WGS-TB/TreeShapeStats.
1 Abstract 1The shape of phylogenetic trees can be used to gain evolutionary insights. A tree's shape specifies 2 the connectivity of a tree, while its branch lengths reflect either the time or genetic distance 3 between branching events; well-known measures of tree shape include the Colless and Sackin 4 imbalance, which describe the asymmetry of a tree. In other contexts, network science has 5 become an important paradigm for describing structural features of networks and using them 6 to understand complex systems, ranging from protein interactions to social systems. Network 7 science is thus a potential source of many novel ways to characterize tree shape, as trees are also 8 networks. Here, we tailor tools from network science, including diameter, average path length, 9 and betweenness, closeness, and eigenvector centrality, to summarize phylogenetic tree shapes. 10 We thereby propose tree shape summaries that are complementary to both asymmetry and the 11 frequencies of small configurations. These new statistics can be computed in linear time and scale 12 well to describe the shapes of large trees. We apply these statistics, alongside some conventional 13 tree statistics, to phylogenetic trees from three very different viruses (HIV, dengue fever and 14 measles), from the same virus in different epidemiological scenarios (influenza A and HIV) and 15 from simulation models known to produce trees with different shapes. Using mutual information 16 and supervised learning algorithms, we find that the statistics adapted from network science 17 perform as well as or better than conventional statistics. We describe their distributions and 18 prove some basic results about their extreme values in a tree. We conclude that network science-19 based tree shape summaries are a promising addition to the toolkit of tree shape features. All our 20 shape summaries, as well as functions to select the most discriminating ones for two sets of trees, 21 are freely available as an R package at http://github.com/Leonardini/treeCentrality. 22 2 Introduction 23 Molecular data describing the evolution, variation and diversity of organisms over time is more 24 widely available than ever before due to rapid improvements in sequencing technology. Using 25 these data to infer the underlying evolutionary process is a key ongoing challenge in many areas 26 of biology. In particular, in infectious disease, it is crucial to understand pathogen adaptation: 27 despite improvements in sanitation and vaccination and the development of antibiotics, infectious 28 pathogens continue to emerge from zoonotic infections and to adapt to human immune responses, 29 vaccines, and antimicrobials. Next-generation sequencing has afforded unprecedented opportu-30 nities to generate pathogen genome sequences in a highly scalable manner, and theoretical tools 31 have been developed to interrogate these data, largely through reconstructed phylogenetic trees. 32There has been considerable interest over the years in comparing the shapes of phylogene...
Seasonal influenza viruses are constantly changing, and produce a different set of circulating strains each season. Small genetic changes can accumulate over time and result in antigenically different viruses; this may prevent the body's immune system from recognizing those viruses. Due to rapid mutations, in particular in the hemagglutinin gene, seasonal influenza vaccines must be updated frequently. This requires choosing strains to include in the updates to maximize the vaccines' benefits, according to estimates of which strains will be circulating in upcoming seasons. This is a challenging prediction task. In this paper we use longitudinally sampled phylogenetic trees based on hemagglutinin sequences from human influenza viruses, together with counts of epitope site polymorphisms in hemagglutinin, to predict which influenza virus strains are likely to be successful. We extract small groups of taxa (subtrees) and use a suite of features of these subtrees as key inputs to the machine learning tools. Using a range of training and testing strategies, including training on H3N2 and testing on H1N1, we find that successful prediction of future expansion of small subtrees is possible from these data, with accuracies of 0.71-0.85 and a classifier 'area under the curve' (AUC) 0.75-0.9.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.