Abstract:Multiple sequence alignments (MSAs) are used for structural1,2 and evolutionary predictions1,2, but the complexity of aligning large datasets requires the use of approximate solutions3, including the progressive algorithm4. Progressive MSA methods start by aligning the most similar sequences and subsequently incorporate the remaining sequences, from leaf-to-root, based on a guide-tree. Their accuracy declines substantially as the number of sequences is scaled up5. We introduce a regressive algorithm that enabl… Show more
“…(A) The edge leading to hummingbirds exhibits the largest number of changes to mitochondria-encoded proteins when considering all internal edges of a bird phylogenetic tree. This maximum likelihood tree was generated from an alignment of concatenated mitochondrial proteins from birds and Bos taurus using T-coffee in regressive mode ( Garriga et al 2019 ), followed by ancestral prediction using PAGAN ( Löytynoja et al 2012 ). Amino acid substitutions between each pair of ancestral and descendant nodes internal to the bird tree (node-to-node) were determined, summed across all positions, and plotted.…”
Section: Resultsmentioning
confidence: 99%
“…Alignments were performed by use of standalone MAFFT (version 7.407) (Katoh and Standley 2013) or by T-coffee (version 13.40.5) in regressive mode (Garriga et al 2019). For initial alignments of insect COI barcodes, MAFFT alignment was performed using an online server (Kuraku et al 2013;Katoh 2017), and translations of barcodes using the appropriate codon tables were performed using AliView (Larsson 2014).…”
Hummingbirds in flight exhibit the highest mass-specific metabolic rate of all vertebrates. The bioenergetic requirements associated with sustained hovering flight raise the possibility of unique amino acid substitutions that would enhance aerobic metabolism. Here, we have identified a non-conservative substitution within the mitochondria-encoded cytochrome c oxidase subunit I (COI) that is fixed within hummingbirds, but not among other vertebrates. This unusual change is also rare among metazoans, but can be identified in several clades with diverse life histories. We performed atomistic molecular dynamics simulations using bovine and hummingbird COI models, thereby bypassing experimental limitations imposed by the inability to modify mtDNA in a site-specific manner. Intriguingly, our findings suggest that COI amino acid position 153 (bovine numbering convention) provides control over the hydration and activity of a key proton channel in COX. We discuss potential phenotypic outcomes linked to this alteration encoded by hummingbird mitochondrial genomes.
“…(A) The edge leading to hummingbirds exhibits the largest number of changes to mitochondria-encoded proteins when considering all internal edges of a bird phylogenetic tree. This maximum likelihood tree was generated from an alignment of concatenated mitochondrial proteins from birds and Bos taurus using T-coffee in regressive mode ( Garriga et al 2019 ), followed by ancestral prediction using PAGAN ( Löytynoja et al 2012 ). Amino acid substitutions between each pair of ancestral and descendant nodes internal to the bird tree (node-to-node) were determined, summed across all positions, and plotted.…”
Section: Resultsmentioning
confidence: 99%
“…Alignments were performed by use of standalone MAFFT (version 7.407) (Katoh and Standley 2013) or by T-coffee (version 13.40.5) in regressive mode (Garriga et al 2019). For initial alignments of insect COI barcodes, MAFFT alignment was performed using an online server (Kuraku et al 2013;Katoh 2017), and translations of barcodes using the appropriate codon tables were performed using AliView (Larsson 2014).…”
Hummingbirds in flight exhibit the highest mass-specific metabolic rate of all vertebrates. The bioenergetic requirements associated with sustained hovering flight raise the possibility of unique amino acid substitutions that would enhance aerobic metabolism. Here, we have identified a non-conservative substitution within the mitochondria-encoded cytochrome c oxidase subunit I (COI) that is fixed within hummingbirds, but not among other vertebrates. This unusual change is also rare among metazoans, but can be identified in several clades with diverse life histories. We performed atomistic molecular dynamics simulations using bovine and hummingbird COI models, thereby bypassing experimental limitations imposed by the inability to modify mtDNA in a site-specific manner. Intriguingly, our findings suggest that COI amino acid position 153 (bovine numbering convention) provides control over the hydration and activity of a key proton channel in COX. We discuss potential phenotypic outcomes linked to this alteration encoded by hummingbird mitochondrial genomes.
“…VCF file) describing their relationship. Algorithms for calculating genome-scale multiple alignments are resource intensive 34,35 and yield a more complex structure compared to a pairwise alignment. Reference flow's use of pairwise alignments also helps to solve an "N+1" problem; adding one additional reference to the second pass requires only that we index the new genome and obtain an additional whole-genome alignment (or otherwise infer such an alignment, e.g.…”
Most sequencing data analyses start by aligning sequencing reads to a linear reference genome. But failure to account for genetic variation causes reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the "reference flow" alignment method that uses information from multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow exhibits a similar level of accuracy and bias avoidance, but with 13% of the memory footprint and 6 times the speed.
“…High throughput sequencing technologies resulted in an unprecedented surge of genomic data. This data explosion challenges many existing analysis pipelines that rely on global multiple sequence alignments (MSA) (Nishimura et al, 2016;Garriga et al, 2019). Although alignment-free methods sequence comparisons exist (Ren et al, 2018), plenty of software still require a global alignment of all sequences (i.e.…”
Section: Introductionmentioning
confidence: 99%
“…In fact, it has been shown that alignment quality decays with an increasing number of sequences (Sievers et al, 2011). Newer software such as PASTA (Mirarab et al, 2015) and the regressive alignment algorithm (RAA) (Garriga et al, 2019) leverage traditional MSA software capabilities to create alignments for hundreds of thousands to millions of sequences. However, these strategies also suffer from the same weaknesses ( Figure S2).…”
Amplicons to Global Gene (A2G 2 ) is a Python wrapper that uses MAFFT and an "Amplicon to Gene" strategy to align very large numbers of sequences while improving alignment accuracy. It is specially developed to deal with conserved genes, where traditional aligners introduce a significant amount of gaps. A2G 2 leverages the add sequences option of MAFFT to align the sequences to a global reference gene and a local reference region. Both of these references can be consensus sequences of trusted sources. Efficient parallelization of these tasks allows A2G 2 to align a very large number of sequences (> 500K) in a reasonable amount of time. A2G 2 can be imported in Python for easier integration with other software, or can be run via command line. Availability: A2G 2 is implemented in Python 3 (3.6) and depends on MAFFT availability. Other package requirements can be found in the requirements.txt file at https://github.com/jshleap/A2G. A2G 2 is also available via PyPi (https: //pypi.org/project/A2G). It is licensed under the LGPLv3. Supplementary information: Supplementary material is available at github as jupyter notebook.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.