BackgroundIn recent years, research in artificial neural networks has resurged, now under the deep-learning umbrella, and grown extremely popular. Recently reported success of DL techniques in crowd-sourced QSAR and predictive toxicology competitions has showcased these methods as powerful tools in drug-discovery and toxicology research. The aim of this work was dual, first large number of hyper-parameter configurations were explored to investigate how they affect the performance of DNNs and could act as starting points when tuning DNNs and second their performance was compared to popular methods widely employed in the field of cheminformatics namely Naïve Bayes, k-nearest neighbor, random forest and support vector machines. Moreover, robustness of machine learning methods to different levels of artificially introduced noise was assessed. The open-source Caffe deep-learning framework and modern NVidia GPU units were utilized to carry out this study, allowing large number of DNN configurations to be explored.ResultsWe show that feed-forward deep neural networks are capable of achieving strong classification performance and outperform shallow methods across diverse activity classes when optimized. Hyper-parameters that were found to play critical role are the activation function, dropout regularization, number hidden layers and number of neurons. When compared to the rest methods, tuned DNNs were found to statistically outperform, with p value <0.01 based on Wilcoxon statistical test. DNN achieved on average MCC units of 0.149 higher than NB, 0.092 than kNN, 0.052 than SVM with linear kernel, 0.021 than RF and finally 0.009 higher than SVM with radial basis function kernel. When exploring robustness to noise, non-linear methods were found to perform well when dealing with low levels of noise, lower than or equal to 20%, however when dealing with higher levels of noise, higher than 30%, the Naïve Bayes method was found to perform well and even outperform at the highest level of noise 50% more sophisticated methods across several datasets.Electronic supplementary materialThe online version of this article (doi:10.1186/s13321-017-0226-y) contains supplementary material, which is available to authorized users.
We introduce a novel method of indexing graph databases in order to facilitate subgraph isomorphism and similarity queries. The index is comprised of two major data structures. The primary structure is a directed acyclic graph which contains a node for each of the unique, induced subgraphs of the database graphs. The secondary structure is a hash table which crossindexes each subgraph for fast isomorphic lookup. In order to create a hash key independent of isomorphism, we utilize a code-based canonical representation of adjacency matrices, which we have further refined to improve computation speed. We validate the concept by demonstrating its effectiveness in answering queries for two practical datasets. Our experiments show that for subgraph isomorphism queries, our method outperforms existing methods by more than an order of magnitude.
2019 novel coronavirus (2019-nCoV) is widespread in China and other countries. The target of 2019-nCoV and severe acute respiratory syndrome coronavirus (SARS-CoV) is angiotensin-converting enzyme 2 (ACE2) positive cells. ACE2 is present in the salivary gland duct epithelium, and thus it could be the target of 2019-nCoV and SARS-CoV. SARS-CoV-related animal model experiments show that it can infect the epithelial cells on the salivary gland duct in Chinese rhesus macaques by targeting ACE2. Clinical studies confirmed that 2019-nCoV and SARS-CoV could be detected in saliva of human patients. We hypothesize that the infection of 2019-nCoV and SARS-CoV will lead to inflammatory pathological lesions in patients' target organs, and possibly inflammatory lesions in salivary glands. 2019-nCoV may cause acute sialoadenitis in the acute phase of infection. After the acute phase, chronic sialoadenitis may be caused by fibrosis repairment. Although there was no direct evidence to prove this, the available indirect evidence indicates a high probability of our hypothesis.
Finding recurring residue packing patterns, or spatial motifs, that characterize protein structural families is an important problem in bioinformatics. To this end, we apply a novel frequent subgraph mining algorithm to three graph representations of protein threedimensional (3D) structure. In each protein graph, a vertex represents an amino acid. Vertex-residues are connected by edges using three approaches: first, based on simple distance threshold between contact residues; second using the Delaunay tessellation from computational geometry, and third using the recently developed almostDelaunay tessellation approach.Applying this approach to a set of graphs representing a protein family from the Structural Classification of Proteins (SCOP) database, we typically identify several hundred common subgraphs equivalent to common packing motifs found in the majority of proteins in the family. We also use the counts of motifs extracted from proteins in two different SCOP families as input variables in a binary classification experiment using Support Vector Machines. The resulting models are capable of predicting the protein family association with the accuracy exceeding 90 percent. Our results indicate that graphs based on both almost-Delaunay and Delaunay tessellations are more sparse than contact distance graph; yet the former afford similar accuracy of classification as the latter. The protein graph mining and classification approaches developed in this paper can be used for rapid and automated annotation of protein structures determined in structural genomics projects.
We find recurring amino-acid residue packing patterns, or spatial motifs, that are characteristic of protein structural families, by applying a novel frequent subgraph mining algorithm to graph representations of protein three-dimensional structure. Graph nodes represent amino acids, and edges are chosen in one of three ways: first, using a threshold for contact distance between residues; second, using Delaunay tessellation; and third, using the recently developed almost-Delaunay edges. For a set of graphs representing a protein family from the Structural Classification of Proteins (SCOP) database, subgraph mining typically identifies several hundred common subgraphs corresponding to spatial motifs that are frequently found in proteins in the family but rarely found outside of it. We find that some of the large motifs map onto known functional regions in two protein families explored in this study, i.e., serine proteases and kinases. We find that graphs based on almost-Delaunay edges significantly reduce the number of edges in the graph representation and hence present computational advantage, yet the patterns extracted from such graphs have a biological interpretation approximately equivalent to that of those extracted from distance based graphs.
Recently there has been increasing interest in the problem of transfer learning, in which the typical assumption that training and testing data are drawn from identical distributions is relaxed. We specifically address the problem of transductive transfer learning in which we have access to labeled training data and unlabeled testing data potentially drawn from different, yet related distributions, and the goal is to leverage the labeled training data to learn a classifier to correctly predict data from the testing distribution. We have derived efficient algorithms for transductive transfer learning based on a novel viewpoint and the Support Vector Machine (SVM) paradigm, of a large margin hyperplane classifier in a feature space. We show that our method can out-perform some recent state-of-the-art approaches for transfer learning on several data sets, with the added benefits of model and data separation and the potential to leverage existing work on support vector machines.
Boosting is a very successful classification algorithm that produces a linear combination of "weak" classifiers (a.k.a. base learners) to obtain high quality classification models. In this paper we propose a new boosting algorithm where base learners have structure relationships in the functional space. Though such relationships are generic, our work is particularly motivated by the emerging topic of pattern based classification for semi-structured data including graphs. Towards an efficient incorporation of the structure information, we have designed a general model where we use an undirected graph to capture the relationship of subgraph-based base learners. In our method, we combine both L1 norm and Laplacian based L2 norm penalty with Logit loss function of Logit Boost. In this approach, we enforce model sparsity and smoothness in the functional space spanned by the basis functions. We have derived efficient optimization algorithms based on coordinate decent for the new boosting formulation and theoretically prove that it exhibits a natural grouping effect for nearby spatial or overlapping features. Using comprehensive experimental study, we have demonstrated the effectiveness of the proposed learning methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.