In the webinar, it was mentioned that once we made a Docker submission, we would receive emails with log files produced by the submission. Will we have control over the content of these log files? If so, are there any restrictions on what we can put in these log files?
I should also point out that it would be difficult to formulate meaningful restrictions on the content of the log files. For example, in response to another question I posed in this forum, the TCGA barcodes of the samples in the testing cohorts will be hidden from us for Sub-Challenge 2. However, one could use the log file to output the TCGA barcode for each testing sample, as inferred using similarity of the given RNA and CNA profiles to publicly available TCGA data. Even if there was a specific rule against putting TCGA barcodes in the log files, one could encode the inferred barcodes into some innocuous-looking set of strings in the log files to be decoded by the participant receiving the log file emails.
Knowing which TCGA samples the testing data consist of would be very useful in improving the proteomic predictions, as for example one could tailor one's algorithm to incorporate sample-specific mutation, methylation, and other -omic datasets. Thus, more broadly, I would also like to know if the challenge organizers are trying to explicitly prohibit us from using any sample-specific information for the testing cohort, or trying to make it hard but not impossible for the purposes of making the challenge more interesting.
Thanks,
Mike
Created by Michal Grzadkowski grzadkow Hi Mike,
I see some misunderstandings regarding the testing data. We don't have TCGA samples as the **testing **data, instead we will use prospective collection of CPTAC samples. We hide the ids of testing samples due to data embargo and privacy concerns. And each participant in the testing data will be renamed as 'Participant_1 to N'. Which been said, there are also no other public available datasets that can be found relevant to our testing data as they are again prospective collections. Therefore knowing the identity of the participants won't grant you any advantage. However, please feel free to use all available resources regarding the **training **samples within and beyond Synapse, such as external clinical or other omics, etc., to help you make better predictions through a more sample-specific manner. Please note that we will only ask you provide the external resource used through the challenge if you ranked on top of the leaderbroad.
Thanks,
Zhi