Public compendia of sequencing data are now measured in petabytes. Accordingly, it is infeasible for researchers to transfer these data to local computers. Recently, the National Cancer Institute began exploring opportunities to work with molecular data in cloud-computing environments. With this approach, it becomes possible for scientists to take their tools to the data and thereby avoid large data transfers. It also becomes feasible to scale computing resources to the needs of a given analysis. We quantified transcript-expression levels for 12,307 RNA-Sequencing samples from the Cancer Cell Line Encyclopedia and The Cancer Genome Atlas. We used two cloud-based configurations and examined the performance and cost profiles of each configuration. Using preemptible virtual machines, we processed the samples for as little as $0.09 (USD) per sample. As the samples were processed, we collected performance metrics, which helped us track the duration of each processing step and quantified computational resources used at different stages of sample processing. Although the computational demands of reference alignment and expression quantification have decreased considerably, there remains a critical need for researchers to optimize preprocessing steps. We have stored the software, scripts, and processed data in a publicly accessible repository (https://osf.io/gqrz9).
Public compendia of raw sequencing data are now measured in petabytes. Accordingly, it is becoming infeasible for individual researchers to transfer these data to local computers. Recently, the National Cancer Institute funded an initiative to explore opportunities and challenges of working with molecular data in cloud-computing environments. With data in the cloud, it becomes possible for scientists to take their tools to the data and thereby avoid large data transfers. It also becomes feasible to scale computing resources to the needs of a given analysis. To evaluate this concept, we quantified transcript-expression levels for 12,307 RNASequencing samples from the Cancer Cell Line Encyclopedia and The Cancer Genome Atlas. We used two cloud-based configurations to process the data and examined the performance and cost profiles of each configuration. Using "preemptible virtual machines", we processed the samples for as little as $0.09 (USD) per sample. In total, we processed the TCGA samples (n=11,373) for only $1,065.49 and simultaneously processed thousands of samples at a time. As the samples were being processed, we collected detailed performance metrics, which helped us to track the duration of each processing step and to identify computational resources used at different stages of sample processing. Although the computational demands of reference alignment and expression quantification have decreased considerably, there remains a critical need for researchers to optimize preprocessing steps (e.g., sorting, converting, and trimming sequencing reads). We have created open-source Docker containers that include all the software and scripts necessary to process such data in the cloud and to collect performance metrics. The processed data are available in tabular format and in Google's BigQuery database (see https://osf.io/gqrz9).
Erwinia amylovora is the causal agent of fire blight, a devastating disease affecting some plants of the Rosaceae family. We isolated bacteriophages from samples collected from infected apple and pear trees along the Wasatch Front in Utah. We announce 19 high-quality complete genome sequences of E. amylovora bacteriophages.
A major challenge in cancer research is to determine the biological and clinical significance of somatic mutations in noncoding regions. This has been studied in terms of recurrence, functional impact, and association to individual regulatory sites, but the combinatorial contribution of mutations to common RNA regulatory motifs has not been explored. Therefore, we developed a new method, MIRA (mutation identification for RNA alterations), to perform an unbiased and comprehensive study of significantly mutated regions (SMR) affecting binding sites for RNA-binding proteins (RBP) in cancer. Extracting signals related to RNA-related selection processes and using RNA sequencing (RNA-seq) data from the same specimens, we identified alterations in RNA expression and splicing linked to mutations on RBP binding sites. We found SRSF10 and MBNL1 motifs in introns, HNRPLL motifs at 5' UTRs, as well as 5' and 3' splice-site motifs, among others, with specific mutational patterns that disrupt the motif and impact RNA processing. MIRA facilitates the integrative analysis of multiple genome sites that operate collectively through common RBPs and aids in the interpretation of noncoding variants in cancer. MIRA is available at https://github.com/comprna/mira The study of recurrent cancer mutations on potential RBP binding sites reveals new alterations in introns, untranslated regions, and long noncoding RNAs that impact RNA processing and provide a new layer of insight that can aid in the interpretation of noncoding variants in cancer genomes. .
8Motivation: Biologists commonly store data in tabular form with observations as rows, attributes 9 as columns, and measurements as values. Due to advances in high-throughput technologies, the 10 sizes of tabular datasets are increasing. Some datasets contain millions of rows or columns. To 11 work effectively with such data, researchers must be able to efficiently extract subsets of the data 12 (using filters to select specific rows and retrieving specific columns). However, existing 13 methodologies for querying tabular data do not scale adequately to large datasets or require 14 specialized tools for processing. We sought a methodology that would overcome these challenges 15 and that could be applied to an existing, text-based format. 16Results: In a systematic benchmark, we tested 10 techniques for querying simulated, tabular 17 datasets. These techniques included a delimiter-splitting method, the Python pandas module, 18 regular expressions, object serialization, the awk utility, and string-based indexing. We found that 19 storing the data in fixed-width formats provided excellent performance for extracting data subsets. 20Because columns have the same width on every row, we could pre-calculate column and row 21 coordinates and quickly extract relevant data from the files. Memory mapping led to additional 22 performance gains. A limitation of fixed-width files is the increased storage requirement of buffer 23 characters. Compression algorithms help to mitigate this limitation at a cost of reduced query 24 speeds. Lastly, we used this methodology to transpose tabular files that were hundreds of gigabytes 25 in size, without creating temporary files. We propose coordinate-based, fixed-width storage as a 26 fast, scalable methodology for querying tabular biological data.Biologists often generate data suitable for representation in an attribute-value system 1 , also known 30 as an information system 2 , simple frame 3 , object-predicate table 4 , or flat file. In this representation, 31 an object might be a biological organism, an attribute might be a characteristic of that organism, 32 and a value might be a datum for that object and attribute. For example, a researcher might observe 33 200 cancer patients (objects) and collect transcriptomic measurements for 20,000 genes (attributes); 34 each value would indicate the relative number of transcripts present in tumor cells for each 35 patient/gene combination 5 . In this example, the data values would have been summarized 36 previously using preprocessing tools, such as a reference aligner and a transcript-quantification 37 algorithm 6-9 . For convenience and compactness, researchers typically store attribute-value data in 38 2-dimensional, tabular formats. Commonly, in such tables, each row contains data for a given 39 object, and each column contains data for a given attribute 10 ; but in some cases, the table is 40 transposed (objects as columns, attributes as rows). Researchers use tabular data to perform 41 analytical tasks, such as executing statistic...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.