Biological sequence comparison is a key step in inferring the relatedness of various organisms and the functional similarity of their components. Thanks to the Next Generation Sequencing efforts, an abundance of sequence data is now available to be processed for a range of bioinformatics applications. Embedding a biological sequence -over a nucleotide or amino acid alphabet -in a lower dimensional vector space makes the data more amenable for use by current machine learning tools, provided the quality of embedding is high and it captures the most meaningful information of the original sequences.Motivated by recent advances in the text document embedding literature, we present a new method, called seq2vec, to represent a complete biological sequence in an Euclidean space. The new representation has the potential to capture the contextual information of the original sequence necessary for sequence comparison tasks. We test our embeddings with protein sequence classification and retrieval tasks and demonstrate encouraging outcomes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.