Given a set of molecular structure data preclassified into a number of classes, the molecular classification problem is concerned with the discovering of interesting structural patterns in the data so that "unseen" molecules not originally in the dataset can be accurately classified. To tackle the problem, interesting molecular substructures have to be discovered and this is done typically by first representing molecular structures in molecular graphs, and then, using graph-mining algorithms to discover frequently occurring subgraphs in them. These subgraphs are then used to characterize different classes for molecular classification. While such an approach can be very effective, it should be noted that a substructure that occurs frequently in one class may also does occur in another. The discovering of frequent subgraphs for molecular classification may, therefore, not always be the most effective. In this paper, we propose a novel technique called mining interesting substructures in molecular data for classification (MISMOC) that can discover interesting frequent subgraphs not just for the characterization of a molecular class but also for the distinguishing of it from the others. Using a test statistic, MISMOC screens each frequent subgraph to determine if they are interesting. For those that are interesting, their degrees of interestingness are determined using an information-theoretic measure. When classifying an unseen molecule, its structure is then matched against the interesting subgraphs in each class and a total interestingness measure for the unseen molecule to be classified into a particular class is determined, which is based on the interestingness of each matched subgraphs. The performance of MISMOC is evaluated using both artificial and real datasets, and the results show that it can be an effective approach for molecular classification.
There exist many methods for classifying genomic data by aligning, comparing, and analyzing primary nucleotide sequences using such algorithms as dynamic programming and kinetic folding, etc.. These methods are, however, not always effective as motifs are more conserved in structures than in sequences. Instead of performing classification based on primary sequences, we therefore propose to perform the task from structure, exploiting the phenomenon in which molecules form from a sequence of nucleotides, beginning with a primary sequence that can fold back onto itself to form a secondary structure and then a tertiary structure. The algorithm we propose is able to perform data mining in structural data and is called the Random Multi-Level Attributed (RMLA) graph algorithm for mining and representing secondary genomic structure from such biomolecule as tRNA. The identification of structural similarity is implemented with information measure concept to characterize the resultant class. Experiments are based on known tRNA structural data from database of compilation of tRNA genes. The results show that our approach is able to effectively classify different class of tRNA secondary structure. We also compare our result with other classification algorithms to prove the effectiveness. Our approach shows a better way to classify structural data. In fact, RMLA graph is not suitable only for the classification of genomic data, wherever graphs are used to model data, it is useful for discovering patterns in the databases.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.