De novo mutations (DNM) are an important source of rare variants and are increasingly being linked to the development of many diseases. Recently, the paternal age effect has been the focus of a number of studies that attempt to explain the observation that increasing paternal age increases the risk for a number of diseases. Using disease-free familial quartets we show that there is a strong positive correlation between paternal age and germline DNM in healthy subjects. We also observed that germline CNVs do not follow the same trend, suggesting a different mechanism. Finally, we observed that DNM were not evenly distributed across the genome, which adds support to the existence of DNM hotspots.
BackgroundThe advent of high throughput sequencing methods breeds an important amount of technical challenges. Among those is the one raised by the discovery of copy-number variations (CNVs) using whole-genome sequencing data. CNVs are genomic structural variations defined as a variation in the number of copies of a large genomic fragment, usually more than one kilobase. Here, we aim to compare different CNV calling methods in order to assess their ability to consistently identify CNVs by comparison of the calls in 9 quartets of identical twin pairs. The use of monozygotic twins provides a means of estimating the error rate of each algorithm by observing CNVs that are inconsistently called when considering the rules of Mendelian inheritance and the assumption of an identical genome between twins. The similarity between the calls from the different tools and the advantage of combining call sets were also considered.ResultsERDS and CNVnator obtained the best performance when considering the inherited CNV rate with a mean of 0.74 and 0.70, respectively. Venn diagrams were generated to show the agreement between the different algorithms, before and after filtering out familial inconsistencies. This filtering revealed a high number of false positives for CNVer and Breakdancer. A low overall agreement between the methods suggested a high complementarity of the different tools when calling CNVs. The breakpoint sensitivity analysis indicated that CNVnator and ERDS achieved better resolution of CNV borders than the other tools. The highest inherited CNV rate was achieved through the intersection of these two tools (81%).ConclusionsThis study showed that ERDS and CNVnator provide good performance on whole genome sequencing data with respect to CNV consistency across families, CNV breakpoint resolution and CNV call specificity. The intersection of the calls from the two tools would be valuable for CNV genotyping pipelines.
BackgroundAlong with the improvement of high throughput sequencing technologies, the genetics community is showing marked interest for the rare variants/common diseases hypothesis. While sequencing can still be prohibitive for large studies, commercially available genotyping arrays targeting rare variants prove to be a reasonable alternative. A technical challenge of array based methods is the task of deriving genotype classes (homozygous or heterozygous) by clustering intensity data points. The performance of clustering tools for common polymorphisms is well established, while their performance when conducted with a large proportion of rare variants (where data points are sparse for genotypes containing the rare allele) is less known. We have compared the performance of four clustering tools (GenCall, GenoSNP, optiCall and zCall) for the genotyping of over 10,000 samples using the Illumina’s HumanExome BeadChip, which includes 247,870 variants, 90% of which have a minor allele frequency below 5% in a population of European ancestry. Different reference parameters for GenCall and different initial parameters for GenoSNP were tested. Genotyping accuracy was assessed using data from the 1000 Genomes Project as a gold standard, and agreement between tools was measured.ResultsConcordance of GenoSNP’s calls with the gold standard was below expectations and was increased by changing the tool’s initial parameters. While the four tools provided concordance with the gold standard above 99% for common alleles, some of them performed poorly for rare alleles. The reproducibility of genotype calls for each tool was assessed using experimental duplicates which provided concordance rates above 99%. The inter-tool agreement of genotype calls was high for approximately 95% of variants. Most tools yielded similar error rates (approximately 0.02), except for zCall which performed better with a 0.00164 mean error rate.ConclusionsThe GenoSNP clustering tool could not be run straight “out of the box” with the HumanExome BeadChip, as modification of hard coded parameters was necessary to achieve optimal performance. Overall, GenCall marginally outperformed the other tools for the HumanExome BeadChip. The use of experimental replicates provided a valuable quality control tool for genotyping projects with rare variants.
Summary: Genotype imputation is now commonly performed following genome-wide genotyping experiments. Imputation increases the density of analyzed genotypes in the dataset, enabling fine-mapping across the genome. However, the process of imputation using the most recent publicly available reference datasets can require considerable computation power and the management of hundreds of large intermediate files. We have developed genipe, a complete genome-wide imputation pipeline which includes automatic reporting, imputed data indexing and management, and a suite of statistical tests for imputed data commonly used in genetic epidemiology (Sequence Kernel Association Test, Cox proportional hazards for survival analysis, and linear mixed models for repeated measurements in longitudinal studies).Availability and Implementation: The genipe package is an open source Python software and is freely available for non-commercial use (CC BY-NC 4.0) at https://github.com/pgxcentre/genipe. Documentation and tutorials are available at http://pgxcentre.github.io/genipe.Contact: louis-philippe.lemieux.perreault@statgen.org or marie-pierre.dube@statgen.orgSupplementary information: Supplementary data are available at Bioinformatics online.
Summary: Genetic association studies making use of high-throughput genotyping arrays need to process large amounts of data in the order of millions of markers per experiment. The first step of any analysis with genotyping arrays is typically the conduct of a thorough data clean up and quality control to remove poor quality genotypes and generate metrics to inform and select individuals for downstream statistical analysis. We have developed pyGenClean, a bioinformatics tool to facilitate and standardize the genetic data clean up pipeline with genotyping array data. In conjunction with a source batch-queuing system, the tool minimizes data manipulation errors, accelerates the completion of the data clean up process and provides informative plots and metrics to guide decision making for statistical analysis.Availability and implementation: pyGenClean is an open source Python 2.7 software and is freely available, along with documentation and examples, from http://www.statgen.org.Contact: louis-philippe.lemieux.perreault@umontreal.ca or marie-pierre.dube@statgen.org
Learning tasks such as those involving genomic data often poses a serious challenge: the number of input features can be orders of magnitude larger than the number of training examples, making it difficult to avoid overfitting, even when using the known regularization techniques. We focus here on tasks in which the input is a description of the genetic variation specific to a patient, the single nucleotide polymorphisms (SNPs), yielding millions of ternary inputs. Improving the ability of deep learning to handle such datasets could have an important impact in medical research, more specifically in precision medicine, where highdimensional data regarding a particular patient is used to make predictions of interest. Even though the amount of data for such tasks is increasing, this mismatch between the number of examples and the number of inputs remains a concern. Naive implementations of classifier neural networks involve a huge number of free parameters in their first layer (number of input features times number of hidden units): each input feature is associated with as many parameters as there are hidden units. We propose a novel neural network parametrization which considerably reduces the number of free parameters. It is based on the idea that we can first learn or provide a distributed representation for each input feature (e.g. for each position in the genome where variations are observed in data), and then learn (with another neural network called the parameter prediction network) how to map a feature's distributed representation (based on the feature's identity not its value) to the vector of parameters specific to that feature in the classifier neural network (the weights which link the value of the feature to each of the hidden units). This approach views the problem of producing the parameters associated with each feature as a multi-task learning problem. We show experimentally on a population stratification task of interest to medical studies that the proposed approach can significantly reduce both the number of parameters and the error rate of the classifier.
Background - The randomized, placebo-controlled COLchicine Cardiovascular Outcomes Trial (COLCOT) has shown the benefits of colchicine 0.5 mg daily to lower the rate of ischemic cardiovascular events in patients with a recent myocardial infarction. Here, we conducted a post-hoc pharmacogenomic study of COLCOT with the aim to identify genetic predictors of the efficacy and safety of treatment with colchicine. Methods - There were 1522 participants of European descent from the COLCOT trial available for the pharmacogenomic study of COLCOT trial. The pharmacogenomic study's primary cardiovascular (CV) endpoint was defined as for the main trial, as time to first occurrence of CV death, resuscitated cardiac arrest, myocardial infarction, stroke or urgent hospitalization for angina requiring coronary revascularization. The safety endpoint was time to the first report of gastrointestinal events. Patients' DNA was genotyped using the Illumina Global Screening array followed by imputation. We performed a genome-wide association study (GWAS) in colchicine-treated patients. Results - None of the genetic variants passed the GWAS significance threshold for the primary CV endpoint conducted in 702 patients in the colchicine arm who were compliant to medication. The GWAS for gastrointestinal events was conducted in all 767 patients in the colchicine arm and found two significant association signals, one with lead variant rs6916345 (hazard ratio (HR)=1.89, 95% confidence interval (CI) 1.52-2.35, P=7.41×10 -9 ) in a locus which colocalizes with Crohn's disease, and one with lead variant rs74795203 (HR= 2.51, 95% CI 1.82-3.47; P=2.70×10 -8 ), an intronic variant in gene SEPHS1 . The interaction terms between the genetic variants and treatment with colchicine versus placebo were significant. Conclusions - We found two genomic regions associated with gastrointestinal events in patients treated with colchicine. Those findings will benefit from replication to confirm that some patients may have genetic predispositions to lower tolerability of treatment with colchicine.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.