Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes

Chae, Han‐Jung; Park, Jin-Woo; Lee, Seong Whan; Nephew, Kenneth P.; Kim, Sun

doi:10.1093/nar/gkt144

Cited by 24 publications

(17 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since a unique read is assigned a unique quadruplet, the deduplexing can be done efficiently. Nubeam can be effective in some areas where the K-mer approach is useful, for example, characterization of protein binding sequence motif (Newburger and Bulyk, 2009), characterizing CpG island by the flanking regions (Chae et al, 2013), and characterizing sequence feature for haplotype grouping (Navarro-Gomez et al, 2015).…”

Section: Discussionmentioning

confidence: 99%

Human microbiome sequences in the light of the Nubeam

Dai

Guan

2019

Preprint

View full text Add to dashboard Cite

8We present Nubeam (nucleotide be a matrix) as a novel reference-free approach to 9 analyze short sequencing reads. Nubeam represents nucleotides by matrices, trans-10 forms a read into a product of matrices, and based on which assigns numbers to reads. 11Nubeam capitalizes on the non-commutative property of matrix multiplication, such 12 that different reads are assigned different numbers, and similar reads similar numbers. 13A sample, which is a collection of reads, becomes a collection of numbers that form an 14 empirical distribution. We demonstrate that the genetic difference between samples 15 can be quantified by the distance between empirical distributions. Nubeam can ac-16 count for GC bias and nucleotide quality, and is computationally efficient; the K-mer 17 method is a special case of Nubeam, but without those benefits. As a reference-18 free approach, Nubeam avoids reference bias and mapping bias and can work with 19 organisms without reference genomes. Thus, Nubeam is ideal to analyze datasets 20 from metagenomic whole-genome sequencing, where the amount of unmapped reads 21 is substantial. When applied to human microbiome sequencing, Nubeam recapit-22 ulated findings made by mapping-based methods, and shed lights on contributions 23 of unmapped reads. In particular, body habitats dominate clustering of unmapped 24 pseudo-samples; there are more outliers in skin whole samples than the skin mapped 25 pseudo-samples; and analysis of unmapped reads suggested that the sequencing depth 26 is far from sufficient for urogenital samples.27 Introduction 29When identifying variants is not a must and the primary interest is to quantify genetic 30 differences between samples (Ravel et al., 2011; Nayfach and Pollard, 2016), it can be ben-31 eficial to analyze short sequencing reads without reference genomes. First, it avoids ref-32 erence bias and mapping bias. Both biases can be alleviated but never overcome because 33 they are intrinsic to the mapping based approach. Second, it avoids uncertainty related 34 to variants-call, particularly when the sequencing coverage is low. Third, it becomes pos-35 sible to analyze organisms that have no reference genomes, or the reference genomes are 36 incomplete or in low quality. 37The prominent reference-free approach is the K-mer method (Jiang et al., 2012; Sub-38 ramanian and Schwartz, 2015; Lu et al., 2017). Simply put, the K-mer method calculates 39 frequencies of each K-mer (K consecutive nucleotides) presented in all reads from a sam-40 ple, and infer differences between samples by comparing K-mer frequencies. In practice, 41 however, the K-mer method has several difficulties. First, it implicitly assumes error-free 42 in reads, and it is difficult-if not impossible-to account for nucleotide quality (Comin 43 et al., 2015; Comin and Schimd, 2016). Second, choosing K can be a headache-too small 44 or too large of K will make the K-mer frequencies less informative. Third, some pairs of 45 K-mers only differ by one nucleotide and other pairs of K-mers differ by K...

show abstract

Section: Discussionmentioning

confidence: 99%

Human microbiome sequences in the light of the Nubeam

Dai

Guan

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…In this paper, we propose a novel task of variable-length k-mer profiling. While the necessity of diversifying k-mer lengths has already been demonstrated in many studies [6,16,29], most of the existing works only support fixed-length k-mers and need an enormous amount of memory, disk space, and time to profile k-mers with a wide range of k's. By leveraging the techniques of binarization and rolling hash for Aho-Corasick automaton, we construct a thinned Aho-Corasick automaton accelerated by rolling hash (TahcoRoll) to profile variable-length k-mers in genomic data.…”

Section: Discussionmentioning

confidence: 99%

“…The best k to characterize different genomic regions can vary. Chae et al [6] have shown that it is necessary to consider patterns of 3-to 10-mers to construct the phylogenetic tree. Rahman et al [29] have proposed to merge the differential occurred k -mers to form longer and variable-length sequences for downstream analysis.…”

mentioning

confidence: 99%

TahcoRoll: An Efficient Approach for Signature Profiling in Genomic Data through Variable-Length k-mers

Jiang

et al. 2017

Preprint

View full text Add to dashboard Cite

Abstract. K -mer profiling has been one of the trending approaches to analyze read data generated by high-throughput sequencing technologies. The tasks of k -mer profiling include, but are not limited to, counting the frequencies and determining the occurrences of short sequences in a dataset. The notion of k -mer has been extensively used to build de Bruijn graphs in genome or transcriptome assembly, which requires examining all possible k -mers presented in the dataset. Recently, an alternative way of profiling has been proposed, which constructs a set of representative k -mers as genomic markers and profiles their occurrences in the sequencing data. This technique has been applied in both transcript quantification through RNA-Seq and taxonomic classification of metagenomic reads. Most of these applications use a set of fixed-size k -mers since the majority of existing k -mer counters are inadequate to process genomic sequences with variable-length k -mers. However, choosing the appropriate k is challenging, as it varies for different applications. As a pioneer work to profile a set of variable-length k -mers, we propose TahcoRoll in order to enhance the Aho-Corasick algorithm. More specifically, we use one bit to represent each nucleotide, and integrate the rolling hash technique to construct an efficient in-memory data structure for this task. Using both synthetic and real datasets, results show that TahcoRoll outperforms existing approaches in either or both time and memory efficiency without using any disk space. In addition, compared to the most efficient state-of-the-art k -mer counters, such as KMC and MSBWT, TahcoRoll is the only approach that can process long read data from both PacBio and Oxford Nanopore on a commodity desktop computer. The source code of TahcoRoll is implemented in C++14, and available at https://github.com/chelseaju/TahcoRoll.git.

show abstract

“…This work was motivated by our previous works in modeling DNA methylation susceptibility [26–28] and conservation of CpG island sequences [29]. We and many scientists believe that DNA methylation is not random and probably there is an instructive mechanisms embedded in the genomic sequences [30].…”

Section: Motivationmentioning

confidence: 99%

Subtype-specific CpG island shore methylation and mutation patterns in 30 breast cancer cell lines

et al. 2016

View full text Add to dashboard Cite

BackgroundAberrant epigenetic modifications, including DNA methylation, are key regulators of gene activity in tumorigenesis. Breast cancer is a heterogeneous disease, and large-scale analyses indicate that tumor from normal and benign tissues, as well as molecular subtypes of breast cancer, can be distinguished based on their distinct genomic, transcriptomic, and epigenomic profiles. In this study, we used affinity-based methylation sequencing data in 30 breast cancer cell lines representing functionally distinct cancer subtypes to investigate methylation and mutation patterns at the whole genome level.ResultsOur analysis revealed significant differences in CpG island (CpGI) shore methylation and mutation patterns among breast cancer subtypes. In particular, the basal-like B type, a highly aggressive form of the disease, displayed distinct CpGI shore hypomethylation patterns that were significantly associated with downstream gene regulation. We determined that mutation rates at CpG sites were highly correlated with DNA methylation status and observed distinct mutation rates among the breast cancer subtypes. These findings were validated by using targeted bisulfite sequencing of differentially expressed genes (n=85) among the cell lines.ConclusionsOur results suggest that alterations in DNA methylation play critical roles in gene regulatory process as well as cytosine substitution rates at CpG sites in molecular subtypes of breast cancer.Electronic supplementary materialThe online version of this article (doi:10.1186/s12918-016-0356-2) contains supplementary material, which is available to authorized users.

show abstract

Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes

Cited by 24 publications

References 27 publications

Human microbiome sequences in the light of the Nubeam

Human microbiome sequences in the light of the Nubeam

TahcoRoll: An Efficient Approach for Signature Profiling in Genomic Data through Variable-Length k-mers

Subtype-specific CpG island shore methylation and mutation patterns in 30 breast cancer cell lines

Contact Info

Product

Resources

About