2021
DOI: 10.48550/arxiv.2108.08143
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Effective and scalable clustering of SARS-CoV-2 sequences

Abstract: SARS-CoV-2, like any other virus, continues to mutate as it spreads, according to an evolutionary process. Unlike any other virus, the number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million. This amount of data has the potential to uncover the evolutionary dynamics of a virus like never before. However, a million is already several orders of magnitude beyond what can be processed by the traditional methods designed to reconstruct a virus's evolutiona… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
2
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 29 publications
0
2
0
Order By: Relevance
“…This approach has been highly successful in the analysis of data from various domains such as graphs [20,19], nodes in graphs [7], electricity consumption [5,6] and images [11]. This approach yields significant success in sequence analysis, since the features representation takes into account the sequential nature of the data, such as texts [33,32,31], electroencephalography and electromyography sequences [9,37], Networks [4], and biological sequences [26,17,23,8]. For biological sequences (DNA and protein), a feature vector based on counts of all length k substrings (called k-mers) occurring exactly or inexactly up to m mismatches (mimicking biological mutations) is proposed in [26].…”
Section: Introductionmentioning
confidence: 99%
“…This approach has been highly successful in the analysis of data from various domains such as graphs [20,19], nodes in graphs [7], electricity consumption [5,6] and images [11]. This approach yields significant success in sequence analysis, since the features representation takes into account the sequential nature of the data, such as texts [33,32,31], electroencephalography and electromyography sequences [9,37], Networks [4], and biological sequences [26,17,23,8]. For biological sequences (DNA and protein), a feature vector based on counts of all length k substrings (called k-mers) occurring exactly or inexactly up to m mismatches (mimicking biological mutations) is proposed in [26].…”
Section: Introductionmentioning
confidence: 99%
“…At this pandemic stage, keeping the spread of new variants under control becomes a key issue. In this context, inspired by a multitude of applications in bioinformatics [16,15,14,18,7], several methods of variants classification have been proposed exploiting Machine Learning (ML) and Deep Learning (DL) techniques [8,6,22]. These methods provide efficient tools for the classification and clustering of SARS-CoV-2 samples.…”
mentioning
confidence: 99%