The Cancer Genomics Cloud

The ISB Cancer Genomics Cloud:

Leveraging Google Cloud Platform for TCGA Analysis

Sheila Reynolds, Michael Miller, Phyliss Lee, Kelly Iverson, Kalle Leinonen, Zack Rodebaugh, Lesley Wilkerson, Preston Holmes, Nicole Deflaux, Dennis Ai, John Lucena, Simon Fung, Sandeep Namburi, Yan Zhang, Walter Dula, David Pot, Jonathan Bingham, Ilya Shmulevich

The ISB Cancer Genomics Cloud (ISB-CGC) is one of three pilot projects funded by the National Cancer Institute with the goal of democratizing access to the TCGA data by substantially lowering the barriers to accessing and computing over this rich dataset. The ISB-CGC is a cloud-based platform that will serve as a large-scale data repository for TCGA data, while also providing the computational infrastructure and interactive exploratory tools necessary to carry out cancer genomics research at unprecedented scales. The ISB-CGC will also facilitate collaborative research by allowing scientists to share data, analyses, and insights in a cloud environment.

The ISB-CGC will provide interactive and programmatic access to the TCGA data, leveraging many aspects of Google Cloud Platform including BigQuery, Compute Engine, and App Engine. Open-access clinical and biospecimen information for all TCGA patients and samples, combined with the Level-3 TCGA data and a variety of genomic reference and platform-annotation sources will be stored in BigQuery, enabling fast SQL-like queries against the entire dataset. Controlled-access DNA and RNA sequence data will be available to dbGaP-authorized users in the original BAM and FASTQ file formats, and using the Global Alliance for Genomics and Health (GA4GH) API.

The ISB-CGC aims to serve the needs of a broad range of cancer researchers ranging from scientists or clinicians who prefer to use an interactive web-based application to access and explore the rich TCGA dataset, to computational scientists who want to write their own custom scripts using languages such as R or Python, accessing the data through APIs, to algorithm developers who want to spin up thousands of virtual machines to analyze hundreds of terabytes of sequence data. The ISB-CGC will allow scientists to interactively define and compare cohorts, examine the underlying molecular data for specific genes or pathways of interest, and share insights with collaborators around the globe. All registered ISB-CGC users will automatically qualify for Google Cloud Platform credits that can be used to upload their own datasets into Google Cloud Storage, and to perform analyses using existing or custom pipelines.

This project has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN261201400007C