Biological sequence comparison is a key step in inferring the relatedness of various organisms and the functional similarity of their components. Thanks to the Next Generation Sequencing efforts, an abundance of sequence data is now available to be processed for a range of bioinformatics applications. Embedding a biological sequence -over a nucleotide or amino acid alphabet -in a lower dimensional vector space makes the data more amenable for use by current machine learning tools, provided the quality of embedding is high and it captures the most meaningful information of the original sequences.Motivated by recent advances in the text document embedding literature, we present a new method, called seq2vec, to represent a complete biological sequence in an Euclidean space. The new representation has the potential to capture the contextual information of the original sequence necessary for sequence comparison tasks. We test our embeddings with protein sequence classification and retrieval tasks and demonstrate encouraging outcomes.
Similarity-based search of sequence collections is a core task in bioinformatics, one dominated for most of the genomic era by exact and heuristic alignment-based algorithms. However, even efficient heuristics such as BLAST may not scale to the data sets now emerging, motivating a range of alignment-free alternatives exploiting the underlying lexical structure of each sequence. In this paper, we introduce two supervised approaches-SuperVec and SuperVecX-to learn sequence embeddings. These methods extend earlier Representation Learning (RepL) based methods to include class-related information for each sequence during training. Including class information ensures that related sequence fragments have proximal representations in the target space, better reflecting the structure of the domain. We show the quality of the embeddings learned through these methods on (i) sequence retrieval and (ii) classification tasks. We also propose an hierarchical tree-based approach specifically designed for the sequence retrieval problem. The resulting methods, which we term H-SuperVec or H-SuperVecX, according to their respective use of SuperVec or SuperVecX, learn embeddings across a range of feature spaces based on exclusive and exhaustive subsets of the class labels. Experiments show that the proposed methods perform better for retrieval and classification tasks over existing (unsupervised) RepL-based approaches. Further, the new methods are an order of magnitude faster than BLAST for the database retrieval task, supporting hybrid approaches that rapidly filter the collection so that only potentially relevant records remain. Such filtering of the original database allows slower but more accurate methods to be executed quickly over a far smaller dataset. Thus, we may achieve faster query processing and higher precision than before. OPEN ACCESSCitation: Kimothi D, Biyani P, Hogan JM, Soni A, Kelly W (2020) Learning supervised embeddings for large scale sequence comparisons. PLoS ONE 15(3): e0216636. https://doi.org/10.Award (QUTPRA) Scholarship for providing introduction in 1990 [2]. Exact algorithms for finding global and local sequence alignments were introduced respectively in [3] and [4]. Here the alignment score constitutes a pseudometric and a principled yet time-consuming method for quantifying sequence similarity and divergence. A high alignment score is taken to indicate a high probability that the sequences are similar. However, this intuition may break down for isolated local alignments or more remote homology, and fail altogether under structural re-arrangement.BLAST rapidly identifies short, high-scoring seed matches between the sequences, extending and combining these as long as a strong match score is maintained. As is well known, the approximate alignment and its significance are estimated using Karlin-Altschul statistics, and these are used to rank the results returned when queries are submitted to sequence collections. BLAST and more recent sequence searching methods such as MMseqs2 [5] uses word-based heuristics...
Similarity-based search of sequence collections is a core task in bioinformatics, one dominated for most of the genomic era by exact and heuristic alignment-based algorithms. However, even efficient heuristics such as BLAST may not scale to the data sets now emerging, motivating a range of alignment-free alternatives exploiting the underlying lexical structure of each sequence.In this paper, we introduce SuperVec, a novel supervised approach to learning sequence embeddings. Our method extends earlier Representation Learning (RL) based methods to include jointly contextual and class-related information for each sequence during training. This ensures that related sequence fragments have proximal representations in the target space, better reflecting the structure of the domain.Such representations may be used for downstream machine learning tasks or employed directly. Here, we apply SuperVec embeddings to a sequence retrieval task, where the goal is to retrieve sequences with the same family label as a given query. The SuperVec approach is extended further through H-SuperVec, a tree-based hierarchical method which learns embeddings across a range of feature spaces based on the class labels and their exclusive and exhaustive subsets.Experiments show that supervised learning of embeddings based on sequence labels using SuperVec and H-SuperVec provides a substantial improvement in retrieval performance over existing (unsupervised) RL-based approaches. Further, the new methods are an order of magnitude faster than BLAST for the database retrieval task, supporting hybrid approaches in which SuperVec rapidly filters the collection so that only potentially relevant records remain, allowing slower, more accurate methods to be executed quickly over a far smaller dataset. Thus, we may achieve faster query processing and higher precision than before.Finally, for some problems, direct use of embeddings is already sufficient to yield high levels of precision and recall. Extending this work to encompass weaker homology is the subject of ongoing research.
Protein-Protein Interactions (PPIs) are a crucial mechanism underpinning the function of the cell. Predicting the likely relationship between a pair of proteins is thus an important problem in bioinformatics, and a wide range of machine-learning based methods have been proposed for this task. Their success is heavily dependent on the construction of the feature vectors, with most using a set of physico-chemical properties derived from the sequence. Few work directly with the sequence itself.Recent works on embedding sequences in a low dimensional vector space has shown the utility of this approach for tasks such as protein classification and sequence search. In this paper, we extend these ideas to the PPI task, making inferences from the pair instead of for the individual sequences. We evaluate the method on human and yeast PPI datasets, benchmarking against the established methods. These results demonstrate that we can obtain sequence encodings for the PPI task which achieve similar levels of performance to existing methods without reliance on complex physico-chemical feature sets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.