2022
DOI: 10.1101/2022.01.15.476464
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores

Abstract: Background: Tools for accurately clustering biological sequences are among the most important tools in computational biology. Two pioneering tools for clustering sequences are CD-HIT and UCLUST, both of which are fast and consume reasonable amounts of memory; however, there is a big room for improvement in terms of cluster quality. Motivated by this opportunity for improving cluster quality, we applied the mean shift algorithm in MeShClust v1.0. The mean shift algorithm is an instance of unsupervised learning.… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 43 publications
0
3
0
Order By: Relevance
“…In addition, pipelines that use sequential discovery and masking stages avoid the inter-tool clustering problem altogether (RepeatModeler). This is an area that is likely to see improvement in coming years as novel sequence distance estimation [103] and clustering techniques [104] are evaluated in the context of TE families.…”
Section: Te Discovery Pipelinesmentioning
confidence: 99%
“…In addition, pipelines that use sequential discovery and masking stages avoid the inter-tool clustering problem altogether (RepeatModeler). This is an area that is likely to see improvement in coming years as novel sequence distance estimation [103] and clustering techniques [104] are evaluated in the context of TE families.…”
Section: Te Discovery Pipelinesmentioning
confidence: 99%
“…To compare efficiency and accuracy of RabbitTClust with these tools, we created a subset of bact-RefSeq called sub-Bact , which contains 10,562 genomes with a total size of 43 GB in FASTA format. We execute MeShClust3 with the commands meshclust -d sub-Bacteria.fna -o sub-Bacteria.clust -t 0.84 -b 1000 -v 4000 (as recommended in [21]) and Gclust using gclust -both -nuc -threads 128 -ext 1 -chunk 2000MB sub-Bacteria.sorted.fna > sub-Bacteria.clust with a larger chunk size for better thread scalability. Using 128 threads, MeShClust3 and Gclust can finish the clustering of sub-Bact with a runtime of 51.60 hours and 25.01 hours, a memory footprint of 139.17 GB and 156.35 GB, and an NMI score of 0.920 and 0.812, respectively.…”
Section: Resultsmentioning
confidence: 99%
“…Recent tools for large-scale clustering of biological sequences include Linclust [19], Gclust [20], and MeShClust3 [21]. Linclust measures similarities by gapless local alignment, which suffers from high runtimes and has a significant memory footprint.…”
Section: Introductionmentioning
confidence: 99%