BROAD INSTITUTE, INC., THE
Cancer is a disease of the genome. The goal of TCGA is to comprehensively characterize the cancer genome to enable 1) the elucidation of the biological bases of cancer, 2) identification of the most promising therapeutic targets, and 3) discovery of biomarkers for personalized cancer medicine. The realization of this potential of cancer genomics must overcome three major types of challenges, (i) technical challenge: nearly all cancer genomic data are noisy and complex, requiring an extensive set of computational procedures to render it suitable for meaningful analyses. (ii) biological challenge: cancer is enormously genetically complex, with any given tumor harboring a mixture of cancer-causing genomic aberrations (“drivers”) and innocent bystander mutations (“passengers”) that have no oncogenic potential. Developing biological insights from cancer genome data thus requires not only an analytical framework that distinguishes drivers from passengers but also downstream experimental validations which in turn will enhance and refine the analysis – an iterative process that begins with access and use of TCGA data by the broader cancer community. This leads us to (iii) community challenge: for the TCGA effort to be truly transformative, its output must be made available in a useful form to the worldwide research community – not just to the investigators generating the data or those with specialized genomic and computational expertise. Disparate types of genomic data and biological knowledge must be integrated in a “common language” that enables cancer biologists, genome scientists, computational scientists and clinical investigators (each of whom has different analytical skills and goals) in their efforts to understand and conquer cancers.
To address these challenges, the Broad-Harvard Genome Data Analysis Center has proposed the development of a high-throughput analysis pipeline that leverages the Broad’s established IT infrastructure as well as its production-pipeline expertise and experience. This pipeline will integrate all data from TCGA network and rapidly generate a pre-defined set of integrative analyses, summaries and graphical illustrations in a format that cancer researchers of diverse backgrounds can understand and exploit, similar to the Results Section of a publication, but without the delay of a lengthy peer-review process.
In our original proposal, this pipeline was envisioned to be build and run at the Broad within Firehose, an already-operational data analysis pipeline. Our initial aims focused on (1) making Firehose compatible with TCGA; (2) adding functionality (analyses modules) as required for TCGA data; and (3) running the pipeline in Production Mode. As Phase II of TCGA launched, it is recognized that the scope of Phase II will be significantly broader (i.e. 20 tumor types characterized in 5 years), the transition to NGS platform will occur at much more rapid timeline, both adding demand on our original proposed work plan. But most significantly, it is clear that all of the TCGA funded GDACs wish to jointly contribute to building and maintaining this TCGA Analysis Pipeline. It is a certainty that the participation and integration with other GDACs will greatly enhance the functionality, versatility and the depth of this pipeline, but at the same, it presents significant new barriers that we need to overcome.
This one-year ARRA supplement will provide support to hire new staff and/or retain existing ones to execute on the following specific aims which will directly address the new barriers brought about by the expanded scope of TCGA Phase II and inclusion of other GDACs in the development of a TCGA Analysis Pipeline, with the end goal of accelerating the launch of a fully-integrated TCGA Analysis Pipeline.