Will we know the TCGA labels of the samples used in the final challenge? That is, will actual TCGA labels be used to identify each of the samples, or will the samples be given aliases? A previous question on this forum asked if in the final challenge we would know the cohort each sample came from (i.e. breast or ovarian) which was answered in the affirmative. How will this information will be conveyed to us in the final data matrices that we receive? Thanks, Michal

Created by Michal Grzadkowski grzadkow
Hi Hannah, As the prospective testing data will be published in another study, we won't disclose the specificity of the testing data including the sample labels due to embargo. We will, however, make sure the 'Gene_ID' and corresponding measurement in the testing data follow the same format as the training data. As for sample names, we will replace the TCGA/CPTAC barcode with pseudonyms, such as 'Patient_1,2,3'. From my understanding, such setting won't affect your code for submission. We will keep every participant updated once the testing data is available and detailed information will be provided. Thanks, Zhi ''
Hi Zhi, I'm confused by the statement "we won't provide you with specific TCGA labels". I think there is still some ambiguity to be resolved. I would expect that the training and testing datasets would have the same format. If TCGA sample labels are present in the training data, they should also be present in the testing data. Though _we_ won't see these datasets or the samples included within them, _our code_ will have access to this information. Correct? If the format (i.e. inclusion/exclusion of TCGA sample labels) differs between the testing and training datasets, please explain the reasoning behind this decision. Perhaps this would be best clarified with schemas that describe the specific format of the testing and training data and show the differences between them. Thanks, Hannah
No, each data type will be saved seperately based on different cancer type (ovarian or breast). However, we won't provide you with specific TCGA labels.
So in the input file, there will be no identifier of the tissue type? Our models will need to be pan tissue ready and not trained for a particular tissue type?
Yes, that is correct.
So if I understand correctly, the testing data will not have any TCGA labels for the sample IDs that our code could use, unlike the training data?
Hi, For training data, the sample ID (TCGA labels) have already been given in each data matrix. For testing data, you won't be given such information for your prediction as the testing data will be hidden from you. We will however provide a preview of the testing data format once the testing data is available. Each data type (RNA, CNA, proteome and phosphoproteome) will be divided and saved seperately based on different cancer type. Once available, we will provide such information as we did for the training data.

Sample Annotation Questions page is loading…