The Cancer Genome Collaboratory is an academic compute cloud designed to enable computational research on the world’s largest and most comprehensive cancer genome dataset, the International Cancer Genome Consortium (ICGC). The ICGC is on target to categorize the genomes of 25,000 tumors by 2018. A subproject of ICGC, the PanCancer Analysis of Whole Genomes (PCAWG) alone has generated over 800TB of harmonized sequence alignments, variants and interpreted data from over 2,800 cancer patients. A dataset of this size requires months to download and significant resources to store and process. By making the ICGC data available in cloud compute form in the Collaboratory, researchers can bring their analysis methods to the cloud, yielding benefits from the high availability, scalability and economy offered by cloud services, avoiding a large investment in static compute resources and essentially eliminating the time needed to download the data. To facilitate the computational analysis on the ICGC data, the Collaboratory has developed software solutions that are optimized for typical cancer genomics workloads, including well tested and accurate genome aligners and somatic variant calling pipelines. We have developed a simple to use, but fast and secure, data transfer tool that imports genomic data from cloud object storage into the user’s compute instances. Because a growing number of cancer datasets have restrictions on their storage locations, it is important to have software solutions that are interoperable across multiple cloud environments. We have successfully demonstrated interoperability across The Cancer Genome Atlas (TCGA) dataset hosted at University of Chicago’s Bionimbus Protected Data Cloud, the ICGC dataset hosted at the Collaboratory, and ICGC datasets stored in the Amazon Web Services (AWS) S3 storage. Lastly, we have developed a non-intrusive user authorization system that allows the Collaboratory to authenticate against the ICGC Data Access Compliance Office (DACO) when researchers require access to controlled tier data. We anticipate that our software solutions will be implemented on additional commercial and academic clouds. The Collaboratory is actively growing, with a target hardware infrastructure of over 3000 CPU cores and 15 petabytes of raw storage. As of November 2016, the Collaboratory holds information on 2,000 ICGC PCAWG donors (500TB total). We anticipate expanding the Collaboratory to host the entire ICGC dataset of 25,000 donors (approximately 5PB) and to extend its data management and analysis facilities across multiple clouds. During the current closed beta phase, the Collaboratory has been successfully utilized by multiple research groups, most notably PCAWG project researchers who analyzed thousands of genomes at scale over a few weeks’ time. The Collaboratory will open to the public during the second quarter of 2017. We invite cancer researchers to learn more about our cloud resources at cancercollaboratory.org, and apply for access to the Collaboratory. Citation Format: Christina K. Yung, George L. Mihaiescu, Bob Tiernay, Junjun Zhang, Francois Gerthoffert, Andy Yang, Jared Baker, Guillaume Bourque, Paul C. Boutros, Bartha M. Knoppers, BF Francis Ouellette, Cenk Sahinalp, Sohrab P. Shah, Vincent Ferretti, Lincoln D. Stein. The Cancer Genome Collaboratory [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2017; 2017 Apr 1-5; Washington, DC. Philadelphia (PA): AACR; Cancer Res 2017;77(13 Suppl):Abstract nr 378. doi:10.1158/1538-7445.AM2017-378
The goal of the International Cancer Genome Consortium (ICGC) is to analyze the cancer genomes of at least 500 tumour samples with matched controls from 50 different cancer types and subtypes, building a comprehensive catalogue of somatic abnormalities for the benefit of the research community. The amount of data ICGC members will generate is close to that of 50,000 human genome projects and, to date, has received commitments for 107 projects to study more than 27,000 tumor genomes. The ICGC Data Coordination Center (DCC) is responsible for collecting, curating, aggregating, and disseminating the data generated by the consortium’s member projects. Given the size and the complexity of the ICGC data, these tasks represent significant scientific and technological challenges that require a performant, robust software infrastructure. Key to this infrastructure is the ability to scale as data grows. Using state-of-the-art Big Data, bioinformatics and cloud computing technologies, we developed a suite of web-based applications and microservices that enable member projects to first submit their data and validate their submissions according to the rules defined in the submission specification. Following validation, the data is processed, annotated and loaded into the data portal using a modular Extract-Transform-Load (ETL) pipeline. Submission, ETL and portal systems are built using scalable and distributed technologies such as Hadoop, Spark, MongoDB and ElasticSearch. Spark is used to validate, join, index, and harmonize annotations on submitted variants while ElasticSearch powers our variant query engine, API and portal displays. Here we present the ICGC Data Portal and describe both the current features and capabilities accessible to users along with the architecture of the underlying infrastructure. The portal provides scientists with powerful and unique tools for exploring and visualizing the millions of variants and annotations available. These include sophisticated, faceted search capabilities making data exploration extremely fast and easy, a suite of interactive Javascript components for in-depth analysis and visualization of specific genomic features, embedded genome and pathway browsers, synthetic cohorts comparisons and a streaming data download service. The portal integrates a large variety of annotations such as variant consequences and frequencies, functional impact factors and druggability. The portal also offers cloud-based tools for searching a catalog of raw ICGC data files stored in worldwide repositories and compute clouds. All source code is open to the community under the GPLv3 license. Citation Format: Junjun Zhang, Bob Tiernay, Dusan Andric, Phuong-My Do, Sid Joshi, Vitalii Slobodianyk, Chang Wang, Shane Wilson, Andy Yang, Vincent Ferretti. The ICGC data portal and its underlying open source software architecture [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2017; 2017 Apr 1-5; Washington, DC. Philadelphia (PA): AACR; Cancer Res 2017;77(13 Suppl):Abstract nr 2602. doi:10.1158/1538-7445.AM2017-2602
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.