Association mapping from sequencing reads using k-mers

Rahman, Atif; Hallgrímsdóttir, Ingileif B.; Eisen, Michael B.; Pachter, Lior

doi:10.7554/elife.32920

Cited by 100 publications

(138 citation statements)

References 50 publications

Supporting

Mentioning

134

Contrasting

Order By: Relevance

“…This approach, not centered around one specific reference genome, can identify biochemical pathways associated with, for example, pathogenicity. This approach has also been applied in humans, where the number of unique k -mers is much higher than in bacterial strains, due to their larger genome (Rahman et al, 2018) . However, this was restricted to case-control situations, and due to high computational load, population structure was corrected only for a subset of k -mers.…”

Section: Introductionmentioning

confidence: 99%

Finding genetic variants in plants without complete genomes

Voichek¹,

Weigel²

2019

Preprint

View full text Add to dashboard Cite

Structural variants and presence/absence polymorphisms are common in plant genomes, yet they are routinely overlooked in genome-wide association studies (GWAS). Here, we expand the genetic variants detected in GWAS to include major deletions, insertions, and rearrangements. We first use raw sequencing data directly to derive short sequences, k -mers, that mark a broad range of polymorphisms independently of a reference genome. We then link k -mers associated with phenotypes to specific genomic regions. Using this approach, we re-analyzed 2,000 traits measured in Arabidopsis thaliana, tomato , and maize populations. Associations identified with k -mers recapitulate those found with single-nucleotide polymorphisms (SNPs), however, with stronger statistical support. Moreover, we identified new associations with structural variants and with regions missing from reference genomes.Our results demonstrate the power of performing GWAS before linking sequence reads to specific genomic regions, which allow detection of a wider range of genetic variants responsible for phenotypic variation.Here, we present an efficient method for k -mer-based GWAS and compare it directly to the conventional SNP-based approach on more than 2,000 phenotypes from three plant species with different genome and population characteristics -A. thaliana , maize and tomato. Most variants identified by SNPs can be detected with k -mers (and vice versa), but k -mers having stronger statistical support.For k -mer-only hits, we demonstrate how different strategies can be used to infer their genomic context, including large structural variants, sequences missing from the reference genome, and organeller variants. Lastly, we compute population structure directly from k -mers, enabling the analysis of species with poor quality or without a reference genome. In summary, we have inverted the conventional approach of building a genome, using it to find population variants, and only then associating variants with phenotypes. In contrast, we begin by associating sequencing reads with phenotypes, and only then infer the genomic context of these sequences. We posit that this change of order is especially effective in plant species, for which defining the full population-level genetic variation based on reference genomes remains highly challenging. Schneeberger et al., 2009) . While traditional GWAS methods will benefit from these technological improvements, so will k -mer based approaches, which will be able to use tags spanning larger genomic distances. Therefore, we posit that for GWAS purposes, k -mer based approaches are ideal because they minimize arbitrary choices when classifying alleles and because they capture more, almost optimal, information from raw sequencing data. 578

show abstract

Section: Introductionmentioning

confidence: 99%

Finding genetic variants in plants without complete genomes

Voichek¹,

Weigel²

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…[26][27][28] Note that several alternative k-mer counting libraries and tools 29,30 have been developed to solve a variety of different biological problems. [31][32][33][34][35] Step 1: Identifying novel k-mers and reads To identify sequences spanning de novo variants, Kevlar scans each read sequenced from the proband. The per-sample abundances of each k-mer are queried from the Count-Min sketches computed in previous steps.…”

Section: Kevlar Workflowmentioning

confidence: 99%

Kevlar: a mapping-free framework for accurate discovery ofde novovariants

Standage¹,

Brown²,

Hormozdiari³

2019

Preprint

View full text Add to dashboard Cite

Motivation: Discovery of genetic variants by whole genome sequencing has proven a powerful approach to study the etiology of complex genetic disorders. Elucidation of all variants is a necessary step in identifying causative variants and disease genes. In particular, there is an increased interest in detection of de novo variation and investigation of its role in various disorders. State-of-the-art methods for variant discovery rely on mapping reads from each individual to a reference genome and predicting variants from difference observed between the mapped reads and the reference genome. This process typically results in millions of variant predictions, most of which are inherited and irrelevant to the phenotype of interest. To distinguish between inherited variation and novel variation resulting from de novo germline mutation, whole-genome sequencing of close relatives (especially parents and siblings) is commonly used. However, standard mapping-based approaches tend to have a high false-discovery rate for de novo variant prediction, which in many cases arises from problems with read mapping. This is a particular challenge in predicting de novo indels and structural variants. Results: We have developed a mapping-free method, Kevlar, for de novo variant discovery based on direct comparison of sequence content between related individuals. Kevlar identifies high-abundance k-mers unique to the individual of interest and retrieves the reads containing these k-mers. These reads are easily partitioned into disjoint sets by shared k-mer content for subsequent locus-by-locus processing and variant calling. Kevlar also utilizes a novel probabilistic approach to score and rank the variant predictions to identify the most likely de novo variants. We evaluated Kevlar on simulated and real pedigrees, and demonstrate its ability to detect both de novo SNVs and indels with high sensitivity and specificity. Availability: https://github.com/dib-lab/kevlar

show abstract

“…To calculate the scores using the equations mentioned above, the prior probability distribution on numbers of copies of k-mers in the genome and conditional probability distributions on k-mer counts in the reads given the copy numbers in the genome need to be defined. When a k-mer appears in the read set due to the presence of one or more copies of the sequence in the genome, Poisson distributions have been observed to model the counts well in genome sequencing data [24]. If a genomic region is present i times, then the counts of the k-mers within that region are assumed to be Poisson distributed with mean λi, where λ is the k-mer coverage of the dataset.…”

Section: Learning Probability Distributions and Estimating Priorsmentioning

confidence: 99%

kRISP-meR: A Reference-free Guide-RNA Design Tool for CRISPR/Cas9

Hera

Rahman

2019

Preprint

Self Cite

View full text Add to dashboard Cite

Genome editing using the CRISPR/Cas9 system requires designing guide RNAs (sgRNA) that are efficient and specific. Guide RNAs are usually designed using reference genomes which limits their use in organisms with no or incomplete reference genomes. Here, we present kRISP-meR, a reference free method to design sgRNAs for CRISPR/Cas9 system. kRISP-meR takes as input a target region and sequenced reads from the organism to be edited and generates sgRNAs that are likely to minimize off-target effects. Our analysis indicates that kRISP-meR is able to identify majority of the guides identified by a widely used sgRNA designing tool, without any knowledge of the reference, while retaining specificity.

show abstract

Association mapping from sequencing reads using k-mers

Cited by 100 publications

References 50 publications

Finding genetic variants in plants without complete genomes

Finding genetic variants in plants without complete genomes

Kevlar: a mapping-free framework for accurate discovery ofde novovariants

kRISP-meR: A Reference-free Guide-RNA Design Tool for CRISPR/Cas9

Contact Info

Product

Resources

About