ICGC-TCGA DREAM Mutation Calling challenge

Created By Kyle Ellrott kellrott
The ICGC-TCGA DREAM Genomic Mutation Calling Challenge (herein, The Challenge) is an international effort to improve standard methods for identifying cancer-associated mutations and rearrangements in whole-genome sequencing (WGS) data. Leaders of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA) cancer genomics projects are joining with Sage Bionetworks and IBM-DREAM to initiate this innovative open crowd-sourced Challenge [1-3]. The goal of this somatic mutation calling (SMC) Challenge is to identify the most accurate mutation detection algorithms, and establish the state-of-the-art. The algorithms in this Challenge must use as input WGS data from tumour and normal samples and output mutation calls associated with cancer. This Challenge is now closed. Only participants that were part of the challenge can have access to the data. Should you have any questions related to the Challenge, please visit our ICGC-TCGA-DREAM SMC Challenge Community Forum. You may review the DREAM8.5 Challenge rules here: syn2295117 BackgroundCancer is a disease of the genome [4], caused by disruptions in a person?s DNA that alter specific gene functions in a population of cells, and specifically, their growth. As the population of cancer cells grows, it is believed the genetic content of the population is further altered by DNA breakages. A metastatic cancer, one that has spread to other parts of the body away from its origin, has evolved from a single cell having a specific DNA mutation or a set of mutations. Understanding the origin and progression of cancer and its mechanisms is still at an early stage today. Mainly, the advancement of cancer research depends on our ability to read the DNA of cancer cells [5-7]. As genome sequencing technologies evolved, next-generation sequencing (NGS) instruments are now able to determine millions of pieces of DNA sequences, or reads, which now collectively span billions of genome single-letter locations. Today, DNA sequencers can produce terabytes of data in just a few hours. Therefore, while the crux of the problem has thus shifted from the biologist to the computer scientist, the picture that explains cancer genomes remains elusive in many ways. Shattering of chromosomes has been recently associated with cancer [8], complex chromosomal translocations [15] are being characterized around the world in cancer research labs in large cohorts of over 300 patients. Nevertheless, the ability to precisely localize a genomic breakage and resolve its association with cancer remains a challenge. In summary, the study of genomic alterations that drive cancer mutations has been accelerated at an unprecedented rate with the advent of next-generation sequencing and related projects around the world. A genomics revolution now aims to systematically characterize every somatic variation in every tumor by sampling large cohorts. While somatic variations can be focused point mutations that create single nucleotide variations (SNVs), they can also be mid-scale copy-number alterations (CNVs), and large-scale intra- or inter-chromosomal rearrangements, i.e., structural variations (SV) [5-7]. See our list of references below to learn more about this challenging and interesting field. MotivationBy relating particular genomic variations in a patient?s tumor to targetable genes, new drugs and treatments tailored to each patient will be developed; this the essence and purpose of personalized medicine. However, accurately identifying these variants and rearrangements using NGS data remains an open problem, as recent studies indicate that existing approaches overlap only about 20%. As the solution to the cancer genome now hides behind the analysis of terabytes of sequencing data, there is an urgent need for reliable data mining and classification methods that can bring NGS into routine clinical practice. Therefore, we believe this Challenge is an excellent setting for bringing researchers around the world to focus in this particular cancer research problem. In fact, we anticipate that the winning Challenge algorithms will become the standard off-the-shelf predictive approaches for the analysis of tens of thousands of cancer genomes sequenced over the next 5 years across many hospitals and bioinformatics labs worldwide. DataThe ICGC-TCGA DREAM Sequence Analysis Challenge will use real data: 10 Tumor/Normal matched genomes from prostate and pancreatic cancers, 5 from each cancer type. 5 simulated-sequencing tumors of increasing complexity will be released to provide easy "training" datasets, to help bring in participants from outside the field of cancer genomics. Each sample will reflect a treatment-naive primary tumor sequenced to ~50x coverage and a paired germ-line sequenced to ~30x coverage. See the Data Description page, for a thorough description of the data used in this Challenge. Data distribution through ICGC is governed by a set of procedures and principles designed to meet legal, ethical and regulatory standards for the sharing of human data. Access to raw data will be granted to ICGC DACO-approved participants only. All primary and validation data will be publicly available, even after the contest is completed, creating a gold-standard community resource. Results will be shared without restriction but raw data will remain under the restrictions and regulations of the ICGC-DACO. ChallengesThe ICGC-TCGA DREAM Sequence Analysis Challenge will be open to all and will aggregate predictive models and source code as a community resource. We believe that the best approach towards developing robust and accurate mutation predictions is to enable an open diverse community where data access is simple and people are incentivized to share. The main advantage of such open challenges lies in encouraging a diversity of analytical approaches from skilled analysts across scientific disciplines, to solve inherently difficult but important questions together. Intel-10 SNV Sub-ChallengeSingle Nucleotide Variants (SNVs) are alterations of a single base within the DNA code, and often cause sensitivity to specific drugs. A typical cancer may contain tens of thousands of SNVs. SNV detection is more reproducible, showing ~50-80% overlap in a set of published studies. ITM1-10 SV Sub-ChallengeStructural Variations (SVs) are duplications, deletions and rearrangements of medium-size to large segments (>100 bp) of the genome. These variations can include one or several breakpoints, at which an adjacency or junction is defined, explaining the breakages that a normal genome would have to experience to become a cancer one. Such genomic rearrangements are often described as being the primary cause of cancer. Over the past few decades, clinical cytogeneticists have been able to link specific chromosome breakpoints to clinically defined cancers, including subtypes of leukemias, lymphomas, and sarcomas. Breakpoint detection in cancer genomes is, anecdotally, exceedingly hard. Our (unpublished) pilot study shows ~30% overlap in predictions across multiple calling methods. Challenge Structure AssessmentA simultaneous comparison to state-of-the-art simulation approaches will take place. The Challenge will run an unbiased validation: predictions will be experimentally tested after all Challenge entries have been submitted. Validation will be performed by the Boutros Lab at the Ontario Institute for Cancer Research (OICR), which is not entering predictive models into the Challenge. After the Challenge closes, at least 5000 DNA candidate somatic mutations will be selected for validation by the Challenge organizers. Selection will be done using a public algorithm. Validation will be using an independent technology that will sequence the mutation and ~75 bp in either direction of it. All mutations be validated in all samples, allowing assessment of both false-negative and false-positive rates. The performance of the predictive algorithms from the participating Challenge teams will be ranked using the validation data: ranking will be based on sensitivity, specificity and balanced accuracy. A description of how participant algorithms and techniques will be compared can be found in our Algorithm's Performance page. IncentivesThe goal of this proposal is to identify the best mutation calling techniques. The main incentives are the following: Publications will be coordinated in collaboration with Nature Publishing Group. 7,500 in prize money has been contributed by two companies: 5,000 from Intel and $2,500 from Inova Translational Medicine Institute. The top-performing teams competing on the real tumor sub-challenge will split the winnings. Intel will contribute software engineering resources to develop an optimized, parallelized, professional implementation of the winning algorithm of the SNV sub-challenge, provided as an open source software package to the community. This optimization will enable the winning method to be applied retrospectively to all existing TCGA and ICGC data and we expect will facilitate its adoption as a widely used community standard for genomic analysis. Those methods identified as the best will be deployed for use in the ICGC/TCGA WG Pan-Cancer project that will commence next year. ICGC and TCGA have both recently announced they will jointly analyze over 2,000 whole genome (WG) datasets as part of the next Pan-Cancer effort with the aim of comprehensively elucidating the genomic changes present in many forms of cancers. Thus, algorithms selected by this DREAM competition enterprise will be positioned to help address the need in the coming year for the WG Pan-Cancer effort. This will provide the largest unified view of cancer genome variation to date. How to Participate in The Challenge Where Do I Start? Submit your application to enter the Challenge and gain access to the data here. You may review the Terms of Participation here: syn2295117. You may now register here! Download the Data Run your algorithms and get a list of genomic variant calls to submit. My Analysis is Finished, Now What? Parse your output to meet certain criteria (VCF v4.1) Submit your results (page opening TBA) Credits Paul C. Boutros, Ontario Institute for Cancer Research Lincoln D. Stein, Ontario Institute for Cancer Research Josh Stuart, University of California, Santa Cruz Gustavo Stolovitzky, IBM, DREAM Stephen Friend, Sage Bionetworks Adam Margolin, Sage Bionetworks Thea Norman, Sage Bionetworks References International Cancer Genome Consortium et al. International network of cancer genome projects. Nature 464, 993?998 (2010). http://icgc.org/ The Cancer Genome Atlas (TCGA). http://cancergenome.nih.gov/ Dialogue for Reverse Engineering Assessments and Methods (DREAM). http://www.the-dream-project.org/ Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, 719?724 (2009). Meyerson, M., Gabriel, S. & Getz, G. Advances in understanding cancer genomes through second-generation sequencing. Nature Rev. Genet. 11, 685?696 (2010). Alkan, C. et al., Genome structural variation discovery and genotyping, Nature Rev. Genetics (2011). Medvedev, P. et al., Computational methods for discovering variation with next-generation sequencing, Nature Methods (2009). Stephens, P. J. et al. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell 144, 27?40 (2011). Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330?337 (2012). Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61?70 (2012). Cancer Genome Atlas Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061?1068 (2008). Cancer Genome Atlas Network. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609?615 (2011). Cancer Genome Atlas Network. Comprehensive genomic characterization of squamous cell lung cancers. Nature 489, 519?525 (2012). Cancer Genome Atlas Network. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl. J. Med. 368, 2059?2074 (2013). Baca, S. C. et al. Punctuated evolution of prostate cancer genomes. Cell 153, 666?677 (2013). Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144, 646?674 (2011). Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214?218 (2013). Ley, T. J. et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66?72 (2008). Korbel, J.O. et al., Paired-end mapping reveals extensive structural variation in human genome, Science, 420?426 (2007). Tuzun, E. et al., Fine-scale structural variation of the human genome, Nat. Genet. 37, 727?732 (2005). About DREAM ChallengesSage Bionetworks and DREAM are convinced that running open computational Challenges focused on important unsolved questions in systems biomedicine can help advance basic and translational science. By presenting the research community with well-formulated questions that usually involve complex data, we effectively enable the sharing and improvement of predictive models, accelerating many-fold the analysis of such data. The ultimate goal, beyond the competitive aspect of these Challenges, is to foster collaborations of like-minded researchers that together will find the solution for vexing problems that matter most to citizens and patients. About SynapseSynapse is an open computational platform designed to facilitate new ways for data scientists to work with data and with each other. Synapse reinforces the power of DREAM Challenges to catalyze a diverse community of researchers to nucleate around a particular scientific question. Synapse?s engaging features such as real-time leaderboards, code-sharing, and provenance tracking incentivize continuous participation in DREAM Challenges. Participants can accelerate scientific progress by generating, sharing, and evolving thousands of predictive models in real time that would have otherwise taken years to produce.

Pancreatic and Prostate (syn2280639) data no longer available via synapse? Help.
differences among different versions of truth set?
Accessing synthetic data
Synthetic Challenge 6 truth file
How to compare sample ids with gdc ids