Accurate and fast aligners are required to handle the steadily increasing volume of sequencing data. Here we present an approach allowing performant alignments of short reads (Illumina) as well as long reads (Pacific Bioscience, Ultralong Oxford Nanopore), while achieving high accuracy, based on a universal three-stage scheme. It is also suitable for the discovery of insertions and deletions that originate from structural variants. We comprehensively compare our approach to other state-of-the-art aligners in order to confirm its performance with respect to accuracy and runtime. As part of our algorithmic scheme, we introduce two line sweep-based techniques called “strip of consideration” and “seed harmonization”. These techniques represent a replacement for chaining and do not rely on any specially tailored data structures. Additionally, we propose a refined form of seeding on the foundation of the FMD-index.
Carbon (C), hydrogen (H), nitrogen (N), oxygen (O), and sulfur (S) atoms intrigue as they are the foundation for amino acid (AA) composition and the folding and functions of proteins and thus define and control the survival of a cell, the smallest unit of life. Here, we calculated the proteomic atom distribution in > 1500 randomly selected species across the entire current phylogenetic tree and identified uracil-5-methyltransferase (U5MTase) of the protozoan parasite Plasmodium falciparum (Pf, strain Pf3D7), with a distinct atom and AA distribution pattern. We determined its apicoplast location and in silico 3D protein structure to refocus attention exclusively on U5MTase with tremendous potential for therapeutic intervention in malaria. Around 300 million clinical cases of malaria occur each year in tropical and subtropical regions of the world, resulting in over one million deaths annually, placing malaria among the most serious infectious diseases. Genomic and proteomic research of the clades of parasites containing Pf is progressing slowly and the functions of most of the~5300 genes are still unknown. We applied a 'bottom-up' comparative proteomic atomics analysis across the phylogenetic tree to visualize a protein molecule on its actual basis -i. e., its atomic level. We identified a protruding Pf3D7-specific U5MTase, determined its 3D protein structure, and identified potential inhibitory drug molecules through in silico drug screening that might serve as possible remedies for the treatment of malaria. Besides, this atomic-based proteome map provides a unique approach for the identification of parasite-specific proteins that could be considered as novel therapeutic targets.
Background Seeding is usually the initial step of high-throughput sequence aligners. Two popular seeding strategies are fixed-size seeding (k-mers, minimizers) and variable-size seeding (MEMs, SMEMs, maximal spanning seeds). The former strategy supports fast seed computation, while the latter one benefits from a high seed entropy. Algorithmic bridges between instances of both seeding strategies are of interest for combining their respective advantages. Results We introduce an efficient strategy for computing MEMs out of fixed-size seeds (k-mers or minimizers). In contrast to previously proposed extend-purge strategies, our merge-extend strategy prevents the creation and filtering of duplicate MEMs. Further, we describe techniques for extracting SMEMs or maximal spanning seeds out of MEMs. A comprehensive benchmarking shows the applicability, strengths, shortcomings and computational requirements of all discussed seeding techniques. Additionally, we report the effects of seed occurrence filters in the context of these techniques. Aside from our novel algorithmic approaches, we analyze hierarchies within fixed-size and variable-size seeding along with a mapping between instances of both seeding strategies. Conclusion Benchmarking shows that our proposed merge-extend strategy for MEM computation outperforms previous extend-purge strategies in the context of PacBio reads. The observed superiority grows with increasing read size and read quality. Further, the presented filters for extracting SMEMs or maximal spanning seeds out of MEMs outperform FMD-index based extension techniques. All code used for benchmarking is available via GitHub at https://github.com/ITBE-Lab/seed-evaluation.
The cover picture shows how proteome‐based atom (C, H, N, O, S) distributions across all species of the phylogenetic tree revealed U5MTase of Plasmodium falciparum as a distinguished possible therapeutic target which in turn was used for in silico structure‐based drug design strategies (i.e., 3D protein structure modeling, virtual chemical library screening, and molecular docking) to identify imanixil as potential inhibitory drug molecule that might serve as possible remedy for the treatment of malaria. More Details can be found in the Full Paper by Subrata Pramanik, Manisha Thaker, Ananda Gopu Perumal, Rajasekaran Ekambaram, Naresh Poondla, Markus Schmidt, Pok‐Son Kim, Arne Kutzner, and Klaus Heese, please see DOI: 10.1002/minf.201900135
Structural variant (SV) calling belongs to the standard tools of modern bioinformatics for identifying and describing alterations in genomes. Initially, this work presents several complex genomic rearrangements that reveal conceptual ambiguities inherent to the representation via basic SV. We contextualize these ambiguities theoretically as well as practically and propose a graph-based approach for resolving them. For various yeast genomes, we practically compute adjacency matrices of our graph model and demonstrate that they provide highly accurate descriptions of one genome in terms of another. An open-source prototype implementation of our approach is available under the MIT license at https://github.com/ITBE-Lab/MA.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.