We note that forward and reverse-complement representations can be merged by any element-wise operation on the 1D output tensors. Similarly to Shrikumar et al. (2017), we place the dense layers (and the output layer) after representation merging. Apart from summation, we consider two alternative merging functions (note that averaging and adding the representation vectors are essentially equivalent). The max function implements the Gödel t-conorm, corresponding to the OR operation in Gödel fuzzy logic. Even though the activations are not restricted to the interval [0,1], high output values can be interpreted as finding a motif on either of the two strands. The Hadamard product is the product t-norm corresponding to the AND operation in the product fuzzy logic. Here, high values may be understood as finding a motif on both strands at the same time.2 Training and tuning 2.1 Class weighting While the data preprocessing procedure results in a balanced training dataset, the mean coverage of pathogen and non-pathogen genomes is drastically different. To explore an alternative way of solving the class imbalance problem, we also simulated an imbalanced training set, where the total of 20 million training reads was simulated with equal mean coverage from all the training genomes, regardless of their labels. This dataset was then used to train ten networks using a class-weighted loss function. PaPrBaG, constituting the state-of-the art in machine learning based pathogenicity prediction, does not support error weighting. For BLAST, a method based on sequence homology, this distinction does not apply at all, as its reference database is constructed over whole genomes. However, this does not influence our final results; the class-weighted networks 1
Denkert (2018) Integrated analysis of the immunological and genetic status in and across cancer types: impact of mutational signatures beyond tumor mutational burden, OncoImmunology, 7:12, e1526613, ABSTRACT Harnessing the immune system by checkpoint blockade has greatly expanded the therapeutic options for advanced cancer. Since the efficacy of immunotherapies is influenced by the molecular make-up of the tumor and its crosstalk with the immune system, comprehensive analysis of genetic and immunologic tumor characteristics is essential to gain insight into mechanisms of therapy response and resistance. We investigated the association of immune cell contexture and tumor genetics including tumor mutational burden (TMB), copy number alteration (CNA) load, mutant allele heterogeneity (MATH) and specific mutational signatures (MutSigs) using TCGA data of 5722 tumor samples from 21 cancer types. Among all genetic variables, MutSigs associated with DNA repair deficiency and AID/APOBEC gene activity showed the strongest positive correlations with immune parameters. For smoking-related and UV-light-exposure associated MutSigs a few positive correlations were identified, while MutSig 1 (clock-like process) correlated non-significantly or negatively with the major immune parameters in most cancer types. High TMB was associated with high immune cell infiltrates in some but not all cancer types, in contrast, high CNA load and high MATH were mostly associated with low immune cell infiltrates. While a bi-or multimodal distribution of TMB was observed in colorectal, stomach and endometrial cancer where its levels were associated with POLE/POLD1 mutations and MSI status, TMB was unimodal distributed in the most other cancer types including NSCLC and melanoma. In summary, this study uncovered specific genetic-immunology associations in major cancer types and suggests that mutational signatures should be further investigated as interesting candidates for response prediction beyond TMB. ARTICLE HISTORY
Viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next-generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. We propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. Nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example, the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy-to-install packages not only enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.
Motivation:We expect novel pathogens to arise due to their fast-paced evolution, and new species to be discovered thanks to advances in DNA sequencing and metagenomics. Moreover, recent developments in synthetic biology raise concerns that some strains of bacteria could be modified for malicious purposes.Traditional approaches to open-view pathogen detection depend on databases of known organisms, which limits their performance on unknown, unrecognized, and unmapped sequences. In contrast, machine learning methods can infer pathogenic phenotypes from single NGS reads, even though the biological context is unavailable. Results: We present DeePaC, a Deep Learning Approach to Pathogenicity Classification. It includes a flexible framework allowing easy evaluation of neural architectures with reverse-complement parameter sharing. We show that convolutional neural networks and LSTMs outperform the state-of-the-art based on both sequence homology and machine learning. Combining a deep learning approach with integrating the predictions for both mates in a read pair results in cutting the error rate almost in half in comparison to the previous state-of-the-art. Availability: The code and the models are available at: https://gitlab.com
The analysis of neuronal processes distributed across multiple cortical areas aims at the identification of interactions between signals recorded at different sites. Such interactions can be described by measuring the stability of phase angles in the case of oscillatory signals or other forms of signal dependencies for less regular signals. Before, however, any form of interaction can be analyzed at a given time and frequency, it is necessary to assess whether all potentially contributing signals are present. We have developed a new statistical procedure for the detection of coincident power in multiple simultaneously recorded analog signals, allowing the classification of events as 'non-accidental co-activation'. This method can effectively operate on single trials, each lasting only for a few seconds. Signals need to be transformed into time-frequency space, e.g. by applying a short-time Fourier transformation using a Gaussian window. The discrete wavelet transform (DWT) is used in order to weight the resulting power patterns according to their frequency. Subsequently, the weighted power patterns are binarized via applying a threshold. At this final stage, significant power coincidence is determined across all subgroups of channel combinations for individual frequencies by selecting the maximum ratio between observed and expected duration of co-activation as test statistic. The null hypothesis that the activity in each channel is independent from the activity in every other channel is simulated by independent, random rotation of the respective activity patterns. We applied this procedure to single trials of multiple simultaneously sampled local field potentials (LFPs) obtained from occipital, parietal, central and precentral areas of three macaque monkeys. Since their task was to use visual cues to perform a precise arm movement, co-activation of numerous cortical sites was expected. In a data set with 17 channels analyzed, up to 13 sites expressed simultaneous power in the range between 5 and 240 Hz. On average, more than 50 of active channels participated at least once in a significant power co-activation pattern (PCP). Because the significance of such PCPs can be evaluated at the level of single trials, we are confident that this procedure is useful to study single trial variability with sufficient accuracy that much of the behavioral variability can be explained by the dynamics of the underlying distributed neuronal processes
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.