kmcEx: memory-frugal and retrieval-efficient encoding of counted <i>k</i>-mers

Jiang, Peng; Luo, Jie; Wang, Yiqi; Deng, Ping-Ji; Schmidt, Bertil; Tang, Xiangjun; Chen, Ningjiang; Wong, Limsoon; Zhao, Liang

doi:10.1093/bioinformatics/btz299

Cited by 5 publications

(3 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance, the 31-mers having a count larger than one of the HapMap sample NA12878 (()) take 90-Gb space on disk. To solve this problem, we have designed a novel coupled Bloom Filter-based algorithm achieving high memory saving ratio and good retrieval efficiency (Jiang et al, 2019). Let f max be the maximum frequency in K , which can be represented by at most h bits (in binary).…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Mining of Variants From Trios for Ventricular Septal Defect Association Study

Jiang

Wang

et al. 2019

Front. Genet.

Self Cite

View full text Add to dashboard Cite

Ventricular septal defect (VSD) is a fatal congenital heart disease showing severe consequence in affected infants. Early diagnosis plays an important role, particularly through genetic variants. Existing panel-based approaches of variants mining suffer from shortage of large panels, costly sequencing, and missing rare variants. Although a trio-based method alleviates these limitations to some extent, it is agnostic to novel mutations and computational intensive. Considering these limitations, we are studying a novel variants mining algorithm from trio-based sequencing data and apply it on a VSD trio to identify associated mutations. Our approach starts with irrelevant k -mer filtering from sequences of a trio via a newly conceived coupled Bloom Filter, then corrects sequencing errors by using a statistical approach and extends kept k -mers into long sequences. These extended sequences are used as input for variants needed. Later, the obtained variants are comprehensively analyzed against existing databases to mine VSD-related mutations. Experiments show that our trio-based algorithm narrows down candidate coding genes and lncRNAs by about 10- and 5-folds comparing with single sequence-based approaches, respectively. Meanwhile, our algorithm is 10 times faster and 2 magnitudes memory-frugal compared with existing state-of-the-art approach. By applying our approach to a VSD trio, we fish out an unreported gene—CD80, a combination of two genes—MYBPC3 and TRDN and a lncRNA—NONHSAT096266.2, which are highly likely to be VSD-related.

show abstract

Section: Methodsmentioning

confidence: 99%

“…Based on the above steps, K f , K m , and K c can be saved into B f , B m , and B c economically; more details are shown in Jiang et al (2019).…”

Section: Methodsmentioning

confidence: 99%

Efficient Mining of Variants From Trios for Ventricular Septal Defect Association Study

Jiang

Wang

et al. 2019

Front. Genet.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Counting the frequencies of k-mers is an algorithm that is widely used in many areas of genomics (Xiao et al, 2018 ); from genome assembly and error detection to sequence alignment and variant calling (Kelley et al, 2010 ; Li et al, 2010 ). Others (Marçais and Kingsford, 2011 ; Rizk et al, 2013 ; Audano and Vannberg, 2014 ; Deorowicz et al, 2015 ; Li and Yan, 2015 ; Jiang et al, 2019 ) have explored ways to optimize k-mer counting with reduced memory and storage. While these k-mer counting algorithms process a single sample, SMUFIN processes k-mer counters of normal and tumoral samples of the same patient together, potentially making the memory footprint even bigger.…”

Section: Introductionmentioning

confidence: 99%

Enabling Genomics Pipelines in Commodity Personal Computers With Flash Storage

et al. 2021

View full text Add to dashboard Cite

Analysis of a patient's genomics data is the first step toward precision medicine. Such analyses are performed on expensive enterprise-class server machines because input data sets are large, and the intermediate data structures are even larger (TB-size) and require random accesses. We present a general method to perform a specific genomics problem, mutation detection, on a cheap commodity personal computer (PC) with a small amount of DRAM. We construct and access large histograms of k-mers efficiently on external storage (SSDs) and apply our technique to a state-of-the-art reference-free genomics algorithm, SMUFIN, to create SMUFIN-F. We show that on two PCs, SMUFIN-F can achieve the same throughput at only one third (36%) the hardware cost and half (45%) the energy compared to SMUFIN on an enterprise-class server. To the best of our knowledge, SMUFIN-F is the first reference-free system that can detect somatic mutations on commodity PCs for whole human genomes. We believe our technique should apply to other k-mer or n-gram-based algorithms.

show abstract