2018
DOI: 10.1101/314260
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses

Abstract: Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding ("embedding") each sequence into a dense, low-dimensional, numeric vector space. Here, we use … Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 8 publications
(12 citation statements)
references
References 48 publications
(75 reference statements)
0
12
0
Order By: Relevance
“…Woloszynek et. al represent each 16S sequence by the set of k-length nucleotide sequences (k-mers) it includes, and embed those k-mers to create a vector representation of each sequence (43).…”
Section: Current Methods For Dimensionality Reductionmentioning
confidence: 99%
“…Woloszynek et. al represent each 16S sequence by the set of k-length nucleotide sequences (k-mers) it includes, and embed those k-mers to create a vector representation of each sequence (43).…”
Section: Current Methods For Dimensionality Reductionmentioning
confidence: 99%
“…In the work of Woloszynek et al (2018), the objective is to add, in addition to taxonomic profiling, a method to retrieve the source environment of a metagenome (phenotype prediction). A Skip-gram word2vec algorithm ( Mikolov et al 2013) is trained for k-mers embeddings and a SIF algorithm ( Arora et al (2017)) is used to create reads and samples embeddings.…”
Section: Related Workmentioning
confidence: 99%
“…Moreover, it is difficult to know the location of a read in genomes because the DNA were fragmented prior to sequencing and there is no particular order to the reads after sequencing. To transform the reads onto something similar to words, a possible approach may be to simply split the sequences into k-mers ( Menegaux et al 2019; Woloszynek et al 2018; Min et al 2017; Q. Liang et al 2020). Various size of k can be considered depending on the task.…”
Section: The Representation Of Metagenomic Data With Embeddingsmentioning
confidence: 99%
See 1 more Smart Citation
“…Similarly (Woloszynek et al, 2019) computed embeddings for all k-mers using a range of k; they also used doc2vec to find the whole protein embedding. (Bepler and Berger, 2019) learned an embedding of each amino acid position incorporating global structural similarity between a pair of proteins and contact map information for each protein.…”
Section: Constant Length Subsequence Vectorsmentioning
confidence: 99%