TCGA: The ISB/MD Anderson Genome Data Analysis Center

 

The Institute for Systems Biology/MD Anderson Cancer Center Genome Data Analysis Center – TCGA

The Cancer Genome Atlas (TCGA) provides an unprecedented opportunity to take an integrated approach toward a systems level understanding of regulatory disruptions in cancer. Such disruptions and their consequences are intertwined within complex dynamical networks through a multitude of interactions among different types of molecules. Understanding such relationships requires multivariate analysis methods that can be effective in the context of highly heterogeneous data, measurement uncertainty, and missing data.

The Cancer Genome Atlas targets more than 30 different cancer types, collecting hundreds of samples for each type. Each cancer type is studied individually by multiple groups across TCGA. The Shmulevich Lab and Wei Zhang Lab at MD Anderson Cancer Center collaborate to form a TCGA Data Analysis Center (GDAC). The ISB/MDACC GDAC is contributing to TCGA in three primary capacities. First, the Center is developing novel computational approaches for analyzing large-scale heterogeneous data. Second, the Center is participating and contributing to numerous working groups focused on individual tumor types. Third, the Center is disseminating the results and resources to the broader community and promoting collaborations within TCGA by developing web-based interactive exploratory tools – such as Regulome Explorer – providing programmatic access to the methods, and enhancing computational infrastructure that promotes collaborative research.

We develop methods for finding associations among the heterogeneous data types in TCGA data. This includes the construction of a feature matrix: a large, heterogeneous matrix which combines virtually all available information regarding patients and samples for a given tumor type. The feature matrix is created by parsing and standardizing both public and protected TCGA data available through the DCC: clinical, mRNA (gene) expression, DNA methylation, microRNA expression, copy number variation, somatic (DNA) mutation data, and RPPA (protein) data.

Our Center also incorporates other sources of information from the GCCs (TCGA Genome Characterization Centers) and other GDACs. This mixed-type feature matrix includes numerical data (both continuous and discrete) and arbitrary unordered categorical data, while also allowing for missing values.

Typical matrices include 20,000 to 50,000 features describing 200 to 1000 tumor samples, and provide a starting-point for all of our downstream analyses, as well as a simple, standardized format for data-sharing between collaborators. From the feature matrix, we derive statistically significant Pairwise associations, and explore these associations for relevant signals pertinent to the development and progression of cancer. These analyses are performed systematically for every tumor analysis working group where the Center is a participant.

To learn more about our Center, please visit our GDAC website: www.cancerregulome.org

This project is supported by Award Number U24CA143835 from the National Cancer Institute.