MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores

Girgis, Hani Z.

doi:10.1101/2022.01.15.476464

Cited by 3 publications

(3 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, pipelines that use sequential discovery and masking stages avoid the inter-tool clustering problem altogether (RepeatModeler). This is an area that is likely to see improvement in coming years as novel sequence distance estimation [103] and clustering techniques [104] are evaluated in the context of TE families.…”

Section: Te Discovery Pipelinesmentioning

confidence: 99%

Methodologies for the De novo Discovery of Transposable Element Families

Storer

Hubley

Rosen

et al. 2022

Genes

View full text Add to dashboard Cite

The discovery and characterization of transposable element (TE) families are crucial tasks in the process of genome annotation. Careful curation of TE libraries for each organism is necessary as each has been exposed to a unique and often complex set of TE families. De Novo methods have been developed; however, a fully automated and accurate approach to the development of complete libraries remains elusive. In this review, we cover established methods and recent developments in De Novo TE analysis. We also present various methodologies used to assess these tools and discuss opportunities for further advancement of the field.

show abstract

Section: Te Discovery Pipelinesmentioning

confidence: 99%

Methodologies for the De novo Discovery of Transposable Element Families

Storer

Hubley

Rosen

et al. 2022

Genes

View full text Add to dashboard Cite

show abstract

“…To compare efficiency and accuracy of RabbitTClust with these tools, we created a subset of bact-RefSeq called sub-Bact , which contains 10,562 genomes with a total size of 43 GB in FASTA format. We execute MeShClust3 with the commands meshclust -d sub-Bacteria.fna -o sub-Bacteria.clust -t 0.84 -b 1000 -v 4000 (as recommended in [21]) and Gclust using gclust -both -nuc -threads 128 -ext 1 -chunk 2000MB sub-Bacteria.sorted.fna > sub-Bacteria.clust with a larger chunk size for better thread scalability. Using 128 threads, MeShClust3 and Gclust can finish the clustering of sub-Bact with a runtime of 51.60 hours and 25.01 hours, a memory footprint of 139.17 GB and 156.35 GB, and an NMI score of 0.920 and 0.812, respectively.…”

Section: Resultsmentioning

confidence: 99%

“…Recent tools for large-scale clustering of biological sequences include Linclust [19], Gclust [20], and MeShClust3 [21]. Linclust measures similarities by gapless local alignment, which suffers from high runtimes and has a significant memory footprint.…”

Section: Introductionmentioning

confidence: 99%

RabbitTClust: enabling fast clustering analysis of millions bacteria genomes with MinHash sketches

Yin

Yan

et al. 2022

Preprint

View full text Add to dashboard Cite

We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences (RefSeq: 455 GB in FASTA format) can be clustered within less than 6 minutes and 1,009,738 GenBank assembled bacterial genomes (4.0 TB in FASTA format) within only 34 minutes on a 128-core workstation. Our results further identify 1,269 repetitive genomes (identical nucleotide content) in RefSeq bacterial genomes.

show abstract

GradHC: Highly Reliable Gradual Hash-based Clustering for DNA Storage Systems

Shabat,

Hadad,

Boruchovsky

et al. 2023

Preprint

View full text Add to dashboard Cite

As data storage challenges grow and existing technologies approach their limits, synthetic DNA emerges as a promising storage solution due to its remarkable density and durability advantages. While cost remains a concern, emerging sequencing and synthetic technologies aim to mitigate it, yet introduce challenges such as errors in the storage and retrieval process. One crucial in a DNA storage system is clustering numerous DNA reads into groups that represent the original input strands. In this paper, we review different methods for evaluating clustering algorithms and introduce a novel clustering algorithm for DNA storage systems, named Gradual Hash-based clustering (GradHC). The primary strength of GradHC lies in its capability to cluster with excellent accuracy various types of designs, including varying strand lengths, cluster sizes (including extremely small clusters), and different error ranges. Benchmark analysis demonstrates that GradHC is significantly more stable and robust than other clustering algorithms previously proposed for DNA storage, while also producing highly reliable clustering results.

show abstract

MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores

Cited by 3 publications

References 43 publications

Methodologies for the De novo Discovery of Transposable Element Families

Methodologies for the De novo Discovery of Transposable Element Families

RabbitTClust: enabling fast clustering analysis of millions bacteria genomes with MinHash sketches

GradHC: Highly Reliable Gradual Hash-based Clustering for DNA Storage Systems

Contact Info

Product

Resources

About