Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters

Chu, Justin; Mohamadi, Hamid; Erhan, Emre; Tse, Jeffery; Chiu, Readman; Yeo, Sarah; Birol, İnanç

doi:10.1073/pnas.1903436117

Cited by 11 publications

(11 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Like the counting Bloom filter, the race conditions are minimized for multithreaded insertion. Code adapted from the Multi-index Bloom filter publication (Chu et al, 2020). • Indexlr An optimized and versatile minimizer calculator.…”

Section: Design and Implementationmentioning

confidence: 99%

btllib: A C++ library with Python interface for efficient genomic sequence processing

Nikolić¹,

Kazemi²,

Coombe³

et al. 2022

JOSS

Self Cite

View full text Add to dashboard Cite

Bioinformaticians often do not have software engineering training or background, and software quality is not the top priority of research groups due to limited time and funding (Georgeson et al., 2019). Additionally, one-off scripts or code is frequently written to perform a specific task instead of reusing existing code. This could be because the pre-existing computer programming code is either not well written, not widely available, insufficiently documented, inefficient, or not general enough. This practice leads to lower quality and non-reusable code. As bioinformatics analyses are increasingly complex and deal with ever more data, high quality code is needed to handle the complexities of the analyses reliably and productively. The solution to this is well designed and documented libraries. For example, SeqAn (Reinert et al., 2017) is a C++ library that implements algorithms and data structures commonly used in bioinformatics. Not all programmers are well versed in C++, so for users of widely used and accessible higher level programming languages such as Python, Biopython (Cock et al., 2009) is available as a set of Python modules with implementations of commonly needed algorithms. Here, we present the btllib library as an addition to this ecosystem with the goal of providing highly efficient, scalable, and ergonomic implementations of bioinformatics algorithms and data structures.

show abstract

Section: Design and Implementationmentioning

confidence: 99%

btllib: A C++ library with Python interface for efficient genomic sequence processing

Nikolić¹,

Kazemi²,

Coombe³

et al. 2022

JOSS

Self Cite

View full text Add to dashboard Cite

show abstract

“…In contrast, the alignment-free methods for biological sequence classification have proven to be efficient and accurate. In terms of memory utilization, a machine-learning model for sequence classification can be more efficient than an alignmentbased method (Chu et al, 2020). Recently, researchers have used a machine-learning method for biological sequence classification but their method was limited for kingdom level classification (Nugent and Adamowicz, 2020).…”

Section: Model Training and Optimizationmentioning

confidence: 99%

Classifying DNA barcode sequences of four insects belonging to Orthoptera order using tensor network

2022

ANRES

View full text Add to dashboard Cite

Importance of the work: Orthoptera species are one of the most rapidly increasing groups of insects being used as food and feed. However, identifying edible insects can be difficult due to their small size and the similar morphological features in closely related species. Therefore, classification of insects is often conducted by amplifying their DNA barcode sequence and comparing it with databases containing reference sequences. However, the absence of reference DNA sequences (such as cytochrome c oxidase subunit I (COI)) may confound predictions of the taxonomic community of interest and make it difficult to characterize biodiversity from DNA samples. Objective: To develop a quantum-inspired tensor network-based machine-learning model to categorize COI sequences for four insects belonging to the Orthoptera order. Materials & Methods: For alignment-free classification, each DNA barcode was represented as a tensor product of k-mers encoded in a D-dimensional space, which acts as the feature map and input for a tensor network layer for the classification. The developed model was tested with two different numbers of tensor units as well as different k-mer sizes. Results: The presented model was effective for making accurate predictions for unseen DNA barcodes and can be generalized for any DNA/RNA sequence categorization. The tensor network classifier could assign COI sequences of varying lengths to four different classes with an accuracy greater than 99% and with fewer hyper-parameters.

show abstract

“…From each gap, we extracted 500 bp flanks from both sides to construct a FASTA file using a combination of in-house scripts, SAMtools (v1.9) [ 24 ], and BEDtools (v2.27.1) [ 25 ]. Finally, we used the BioBloomMIMaker utility from BioBloom Tools (v2.3.2) [ 26 ] to construct a multi-index Bloom filter for each flank. Next, using Bio-BloomMICategorizer [ 26 ] we built a FASTǪ file by selecting any read, along with its mate, that mapped to a gap flank sequence.…”

Section: Ethodsmentioning

confidence: 99%

“…Finally, we used the BioBloomMIMaker utility from BioBloom Tools (v2.3.2) [ 26 ] to construct a multi-index Bloom filter for each flank. Next, using Bio-BloomMICategorizer [ 26 ] we built a FASTǪ file by selecting any read, along with its mate, that mapped to a gap flank sequence. For each gap, this pair of FASTA and FASTǪ files was the input used to run GapPredict.…”

Section: Ethodsmentioning

confidence: 99%

GapPredict – A Language Model for Resolving Gaps in Draft Genome Assemblies

Chen

Chu

Zhang

et al. 2021

IEEE/ACM Trans. Comput. Biol. and Bioinf.

Self Cite

View full text Add to dashboard Cite

Short-read DNA sequencing instruments can yield over 10 12 bases per run, typically composed of reads 150 bases long. Despite this high throughput, de novo assembly algorithms have difficulty reconstructing contiguous genome sequences using short reads due to both repetitive and difficult-to-sequence regions in these genomes. Some of the short read assembly challenges are mitigated by scaffolding assembled sequences using paired-end reads. However, unresolved sequences in these scaffolds appear as "gaps". Here, we introduce GapPredict -an implementation of a proof of concept that uses a character-level language model to predict unresolved nucleotides in scaffold gaps. We benchmarked GapPredict against the state-of-the-art gap-filling tool Sealer, and observed that the former can fill 65.6% of the sampled gaps that were left unfilled by the latter with high similarity to the reference genome, demonstrating the practical utility of deep learning approaches to the gap-filling problem in genome assembly.

show abstract

Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters

Cited by 11 publications

References 52 publications

btllib: A C++ library with Python interface for efficient genomic sequence processing

btllib: A C++ library with Python interface for efficient genomic sequence processing

Classifying DNA barcode sequences of four insects belonging to Orthoptera order using tensor network

GapPredict – A Language Model for Resolving Gaps in Draft Genome Assemblies

Contact Info

Product

Resources

About