John Vivian scite author profile

Since its 2001 debut, the University of California, Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu/) team has provided continuous support to the international genomics and biomedical communities through a web-based, open source platform designed for the fast, scalable display of sequence alignments and annotations landscaped against a vast collection of quality reference genome assemblies. The browser's publicly accessible databases are the backbone of a rich, integrated bioinformatics tool suite that includes a graphical interface for data queries and downloads, alignment programs, command-line utilities and more. This year's highlights include newly designed home and gateway pages; a new ‘multi-region’ track display configuration for exon-only, gene-only and custom regions visualization; new genome browsers for three species (brown kiwi, crab-eating macaque and Malayan flying lemur); eight updated genome assemblies; extended support for new data types such as CRAM, RNA-seq expression data and long-range chromatin interaction pairs; and the unveiling of a new supported mirror site in Japan.

show abstract

Error rates for nanopore discrimination among cytosine, methylcytosine, and hydroxymethylcytosine along individual DNA strands

Schreiber

Wescoe

Abu-Shumays

et al. 2013

Proc. Natl. Acad. Sci. U.S.A.

162

149

View full text Add to dashboard Cite

Cytosine, 5-methylcytosine, and 5-hydroxymethylcytosine were identified during translocation of single DNA template strands through a modified Mycobacterium smegmatis porin A (M2MspA) nanopore under control of phi29 DNA polymerase. This identification was based on three consecutive ionic current states that correspond to passage of modified or unmodified CG dinucleotides and their immediate neighbors through the nanopore limiting aperture. To establish quality scores for these calls, we examined ∼3,300 translocation events for 48 distinct DNA constructs. Each experiment analyzed a mixture of cytosine-, 5-methylcytosine-, and 5-hydroxymethylcytosine-bearing DNA strands that contained a marker that independently established the correct cytosine methylation status at the target CG of each molecule tested. To calculate error rates for these calls, we established decision boundaries using a variety of machine-learning methods. These error rates depended upon the identity of the bases immediately 5′ and 3′ of the targeted CG dinucleotide, and ranged from 1.7% to 12.2% for a single-pass read. We estimate that Q40 values (0.01% error rates) for methylation status calls could be achieved by reading single molecules 5-19 times depending upon sequence context.MspA | epigenetics E pigenetic modifications of DNA help regulate gene transcription in biological cells. In mammals, 5-methylcytosine (mC) modification of CG dinucleotides is known to influence development (1, 2) and contribute to human diseases including cancer (3). Other modifications have been detected at carbon 5 of cytosine including 5-hydroxymethylcytosine (hmC) (4), and more recently 5-formylcytosine, and 5-carboxycytosine (5). Physiological roles for hmC in carcinogenesis and embryonic stem cell differentiation have been proposed (6).High-throughput techniques for mC detection are based on bisulfite treatment of genomic DNA (7). In the conventional assay, cytosine (but not mC nor hmC) is converted to uracil (8). Thus, positions not converted to uracil identify cytosines that were modified in the original genomic sequence. In a landmark paper, Lister et al. (9) used this technique to map genome-wide cytosine methylation in human embryonic stem cells and fetal lung fibroblasts at single-nucleotide precision. Recently, bisulfite strategies for discriminating between mC and hmC using the Tet1 enzyme (10) or by chemical modification of hmC (11) have been described.Single-molecule techniques have emerged as possible alternatives to bisulfite treatment for detecting epigenetic modifications of DNA (12). These single-molecule approaches share several useful features including few processing steps before sequence analysis, long reads that routinely exceed several thousand nucleotides, and the ability to read native DNA strands in heterogeneous mixtures. The most advanced of these single-molecule techniques, from Pacific Biosciences, uses fluorescence to detect labeled nucleotide triphosphates during daughter-strand elongation. This elongation is catalyzed by a DNA pol...

show abstract

Co-expression networks reveal the tissue-specific regulation of transcription and splicing

et al. 2017

View full text Add to dashboard Cite

Gene co-expression networks capture biologically important patterns in gene expression data, enabling functional analyses of genes, discovery of biomarkers, and interpretation of genetic variants. Most network analyses to date have been limited to assessing correlation between total gene expression levels in a single tissue or small sets of tissues. Here, we built networks that additionally capture the regulation of relative isoform abundance and splicing, along with tissue-specific connections unique to each of a diverse set of tissues. We used the Genotype-Tissue Expression (GTEx) project v6 RNA sequencing data across 50 tissues and 449 individuals. First, we developed a framework called Transcriptome-Wide Networks (TWNs) for combining total expression and relative isoform levels into a single sparse network, capturing the interplay between the regulation of splicing and transcription. We built TWNs for 16 tissues and found that hubs in these networks were strongly enriched for splicing and RNA binding genes, demonstrating their utility in unraveling regulation of splicing in the human transcriptome. Next, we used a Bayesian biclustering model that identifies network edges unique to a single tissue to reconstruct Tissue-Specific Networks (TSNs) for 26 distinct tissues and 10 groups of related tissues. Finally, we found genetic variants associated with pairs of adjacent nodes in our networks, supporting the estimated network structures and identifying 20 genetic variants with distant regulatory impact on transcription and splicing. Our networks provide an improved understanding of the complex relationships of the human transcriptome across tissues.

show abstract

Comparative Tumor RNA Sequencing Analysis for Difficult-to-Treat Pediatric and Young Adult Patients With Cancer

et al. 2019

View full text Add to dashboard Cite

show abstract

Rapid and efficient analysis of 20,000 RNA-seq samples with Toil

Vivian

Nothaft

et al. 2016

Preprint

View full text Add to dashboard Cite

Toil is portable, open-source workflow software that supports contemporary workflow definition languages and can be used to securely and reproducibly run scientific workflows efficiently at large-scale. To demonstrate Toil, we processed over 20,000 RNA-seq samples to create a consistent meta-analysis of five datasets free of computational batch effects that we make freely available. Nearly all the samples were analysed in under four days using a commercial cloud cluster of 32,000 preemptable cores. Figure 1. (Left)A dependency graph of the RNA-seq pipeline we developed (called CGL). CutAdapt was used to remove extraneous adapters, STAR was used for alignment and read coverage, and RSEM and Kallisto were used to produce quantification data. (Right) A scatter plot showing the Pearson correlation between the results of the TCGA best-practices pipeline and the CGL pipeline. 10,000 randomly selected sample/gene pairs were subset from the entire TCGA cohort and the normalized counts were plot against each other; this process was repeated 5 times with no change in Pearson correlation. The unit for counts is: log2(norm counts+1).Contemporary genomic datasets contain tens of thousands of samples and petabytes of sequencing data 1,2,3 . Genomic processing pipelines can consist of dozens of individual steps, each with their own set of parameters 4,5 . As a result of this size and complexity, computational resource limitations and reproducibility are becoming a major concern within genomics. In response to these interrelated issues, we have created Toil. Reproducible WorkflowsTo support the sharing of scientific workflows, Toil is the first software to execute Common Workflow Language (CWL, Supplementary Note 7) and provide draft support for Workflow Description Language (WDL), both 1 . CC-BY-NC 4.0 International license It is made available under a was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint (which . http://dx.doi.org/10.1101/062497 doi: bioRxiv preprint first posted online Jul. 7, 2016; burgeoning standards for scientific workflows 6,7 . A workflow is composed of a set of tasks, or jobs, that are orchestrated by specification of a set of dependencies that map the inputs and outputs between jobs. In addition to CWL and draft WDL support, Toil provides a Python API that allows workflows to be declared statically, or generated dynamically, so that jobs can define further jobs as needed (Supplementary Note 1). The jobs defined in either CWL or Python can consist of Docker containers, which permit sharing of a program without requiring individual tool installation or configuration within a specific environment. Open-source workflows that invoke containers can therefore be run precisely and reproducibly, regardless of environment. We provide a repository of workflows as examples 8 . Toil also integrates with Apache Spark 9 (Supplementary Note 6, Supplementary Fig. 4), and can be used to rapidly create cont...

show abstract

A Bayesian Framework for Detecting Gene Expression Outliers in Individual Samples

Vivian

Eizenga

Beale

et al. 2019

Preprint

View full text Add to dashboard Cite

Objective: Many antineoplastics are designed to target upregulated genes, but quantifying upregulation in a single patient sample requires an appropriate set of samples for comparison. In cancer, the most natural comparison set is unaffected samples from the matching tissue, but there are often too few available unaffected samples to overcome high inter-sample variance. Moreover, some cancer samples have misidentified tissues or origin, or even composite-tissue phenotypes. Even if an appropriate comparison set can be identified, most differential expression tools are not designed to accommodate comparing to a single patient sample. Materials and Methods:We propose a Bayesian statistical framework for gene expression outlier detection in single samples. Our method uses all available data to produce a consensus background distribution for each gene of interest without requiring the researcher to manually select a comparison set. The consensus distribution can then be used to quantify over-and under-expression. Results:We demonstrate this method on both simulated and real gene expression data. We show that it can robustly quantify overexpression, even when the set of comparison samples lacks ideally matched tissues samples. Further, our results show that the method can identify appropriate comparison sets from samples of mixed lineage and rediscover numerous known gene-cancer expression patterns. Conclusions:This exploratory method is suitable for identifying expression outliers from comparative RNA-seq analysis for individual samples and Treehouse, a pediatric precision medicine group that leverages RNA-seq to identify potential therapeutic leads for patients, plans to explore this method for processing their pediatric cohort.

show abstract

Bayesian Framework for Detecting Gene Expression Outliers in Individual Samples

Vivian

Eizenga

Beale

et al. 2020

JCO Clinical Cancer Informatics

View full text Add to dashboard Cite

PURPOSE Many antineoplastics are designed to target upregulated genes, but quantifying upregulation in a single patient sample requires an appropriate set of samples for comparison. In cancer, the most natural comparison set is unaffected samples from the matching tissue, but there are often too few available unaffected samples to overcome high intersample variance. Moreover, some cancer samples have misidentified tissues of origin or even composite-tissue phenotypes. Even if an appropriate comparison set can be identified, most differential expression tools are not designed to accommodate comparisons to a single patient sample. METHODS We propose a Bayesian statistical framework for gene expression outlier detection in single samples. Our method uses all available data to produce a consensus background distribution for each gene of interest without requiring the researcher to manually select a comparison set. The consensus distribution can then be used to quantify over- and underexpression. RESULTS We demonstrate this method on both simulated and real gene expression data. We show that it can robustly quantify overexpression, even when the set of comparison samples lacks ideally matched tissue samples. Furthermore, our results show that the method can identify appropriate comparison sets from samples of mixed lineage and rediscover numerous known gene-cancer expression patterns. CONCLUSION This exploratory method is suitable for identifying expression outliers from comparative RNA sequencing (RNA-seq) analysis for individual samples, and Treehouse, a pediatric precision medicine group that leverages RNA-seq to identify potential therapeutic leads for patients, plans to explore this method for processing its pediatric cohort.

show abstract

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

John Vivian

Toil enables reproducible, open source, big biomedical data analyses

OUP accepted manuscript

Error rates for nanopore discrimination among cytosine, methylcytosine, and hydroxymethylcytosine along individual DNA strands

Co-expression networks reveal the tissue-specific regulation of transcription and splicing

Comparative Tumor RNA Sequencing Analysis for Difficult-to-Treat Pediatric and Young Adult Patients With Cancer

Rapid and efficient analysis of 20,000 RNA-seq samples with Toil

A Bayesian Framework for Detecting Gene Expression Outliers in Individual Samples

Bayesian Framework for Detecting Gene Expression Outliers in Individual Samples

Contact Info

Product

Resources

About