Motivation: While profile hidden Markov models (HMMs) are successful and powerful methods to recognize homologous proteins, they can break down when homology becomes too distant due to lack of sufficient training data. We show that we can improve the performance of HMMs in this domain by using a simple simulated model of evolution to create an augmented training set.Results: We show, in two different remote protein homolog tasks, that HMMs whose training is augmented with simulated evolution outperform HMMs trained only on real data. We find that a mutation rate between 15 and 20% performs best for recognizing G-protein coupled receptor proteins in different classes, and for recognizing SCOP super-family proteins from different families.Contacts: anoop.kumar@tufts.edu;lenore.cowen@tufts.edu
In recent years, Twitter has become one of the most important modes for social networking and disseminating content on a variety of topics. It has developed into a popular medium for political discourse and social organization during elections. There has been growing body of literature demonstrating the ability to predict the outcome of elections from Twitter data.This works aims to test the predictive power of Twitter in inferring the winning candidate and vote percentages of the candidates in an election. Our prediction is based on the number of times the name of a candidate is mentioned in tweets prior to elections. We develop new methods to augment the counts by counting not only the presence of candidate's official names but also their aliases and commonly appearing names. In addition, we devised a technique to include relevant and filter irrelevant tweets based on predefined set of keywords. Our approach is successful in predicting the winner of all three presidential elections held in Latin America during the months of
Using the Matt structure alignment program, we take a tour of protein space, producing a hierarchical clustering scheme that divides protein structural domains into clusters based on geometric dissimilarity. While it was known that purely structural, geometric, distance-based measures of structural similarity, such as Dali/FSSP, could largely replicate hand-curated schemes such as SCOP at the family level, it was an open question as to whether any such scheme could approximate SCOP at the more distant superfamily and fold levels. We partially answer this question in the affirmative, by designing a clustering scheme based on Matt that approximately matches SCOP at the superfamily level, and demonstrates qualitative differences in performance between Matt and DaliLite. Implications for the debate over the organization of protein fold space are discussed. Based on our clustering of protein space, we introduce the Mattbench benchmark set, a new collection of structural alignments useful for testing sequence aligners on more distantly homologous proteins.
Motivation: One of the most successful methods to date for recognizing protein sequences that are evolutionarily related, has been profile hidden Markov models. However, these models do not capture pairwise statistical preferences of residues that are hydrogen bonded in β-sheets. We thus explore methods for incorporating pairwise dependencies into these models.Results: We consider the remote homology detection problem for β-structural motifs. In particular, we ask if a statistical model trained on members of only one family in a SCOP β-structural superfamily, can recognize members of other families in that superfamily. We show that HMMs trained with our pairwise model of simulated evolution achieve nearly a median 5% improvement in AUC for β-structural motif recognition as compared to ordinary HMMs.Availability: All datasets and HMMs are available at: http://bcb.cs.tufts.edu/pairwise/Contact: anoop.kumar@tufts.edu; lenore.cowen@tufts.edu
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.