One of the most powerful techniques to study proteins is to look for recurrent fragments (also called substructures), then use them as patterns to characterize the proteins under study. Although protein sequences have been extensively studied in the literature, studying protein three-dimensional (3D) structures can reveal relevant structural and functional information that may not be derived from protein sequences alone. An emergent trend consists of parsing proteins 3D structures into graphs of amino acids. Hence, the search of recurrent substructures is formulated as a process of frequent subgraph discovery where each subgraph represents a 3D motif. In this scope, several efficient approaches for frequent 3D motif discovery have been proposed in the literature. However, the set of discovered 3D motifs is too large to be efficiently analyzed and explored in any further process. In this article, we propose a novel pattern selection approach that shrinks the large number of frequent 3D motifs by selecting a subset of representative ones. Existing pattern selection approaches do not exploit the domain knowledge. Yet, in our approach, we incorporate the evolutionary information of amino acids defined in the substitution matrices in order to select the representative 3D motifs. We show the effectiveness of our approach on a number of real datasets. The results issued from our experiments show that considering the substitution between amino acids allows our approach to detect many similarities between patterns that are ignored by current subgraph selection approaches, and that it is able to considerably decrease the number of 3D motifs while enhancing their interestingness.
International audienceWith the increasing size and complexity of available databases, existing machine learning and data mining algorithms are facing a scalability challenge. In many applications, the number of features describing the data could be extremely high. This hinders or even could make any further exploration infeasible. In fact, many of these features are redundant or simply irrelevant. Hence, feature selection plays a key role in helping to overcome the problem of information overload especially in big data applications. Since many complex datasets could be modeled by graphs of interconnected labeled elements, in this work, we are particularly interested in feature selection for subgraph patterns. In this paper, we propose MR-SimLab, a MapReduce-based approach for subgraph selection from large input subgraph sets. In many applications, it is easy to compute pairwise similarities between labels of the graph nodes. Our approach leverages such rich information to measure an approximate subgraph matching by aggre-gating the elementary label similarities between the matched nodes. Based on the aggregated similarity scores, our approach selects a small subset of informative representative subgraphs. We provide a distributed implementation of our algorithm on top of the MapReduce framework that optimizes the computational efficiency of our approach for big data applications. We experimentally evaluate MR-SimLab on real datasets. The obtained results show that our approach is scalable and that the selected subgraphs are informative
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.