Protein samples for subchallenge 2

Hi, How can we select relevant samples in the CNA and RNA test data if we do not know the samples in the proteomics test data that we must predict? David

Created by David McDonald DavidMcDonald93
.
Thanks!
Dear Manuel, For example, CNA in the testing data will have the same number of genes as the training data (11859). The same situation applies to RNA and protein. The genes across different omic platforma share some but not all entities. You need to build a model to predict all 7061 proteins in the testing proteome even some of the genes/proteins are not available at their CNA or RNA level. The question here is to explore whether using information of other genes will help us predict the abundance of a particular protein if the genetic information of this protein is unavailable. I hope it is more clear this time. Best, Zhi
I'm a bit confused by "the number of features (genes/proteins) for the three test matrices will be exactly the same as the training matrices." The training matrices do not have the same set of genes for RNA, CNV, proteins. Will matrices (mRNA, CNV, and the prediction we are to provide) will have exactly the same dimensions? Because the express lane data for subchallenge 2 has different dimensions: wc -l express_lane/sc2/* 11860 express_lane/sc2/prospective_ova_CNA_median_sort_common_gene_11859.txt 7062 express_lane/sc2/prospective_ova_pro_gold_express.txt 15122 express_lane/sc2/prospective_ova_rna_seq_sort_common_gene_15121.txt To summarize: **Will the test data be a consistent subset of all of those, or are we expect to predict protein where we may have CNV but no RNA for example?** Thanks,
Yes.
And this is the same for subchallenge 3 as well? David
Dear David, That is corrrect. Best, Zhi
Thank you again for your reply Zhi, So, just to confirm, there will be a fixed number of samples (20 for the first round) for the CNA, RNA and proteomic test data. The number of samples will be the same for all three test matrices and will be ordered the same way across all three data sets. Also, the number of features (genes/proteins) for the three test matrices will be exactly the same as the training matrices. David
Dear Ari, Yes, for the first leaderboard, we have 20 samples. But for the second and final round, we might have different sample size. We did not reveal the sample size for the leaderboard as it in principle doesn't affect how you load the data. Please refer to the dryrun script above as an example model for subchallenge2. Best, Zhi
Dear David, The testing data have independent samples rather than TCGA collection, so they don't have 'TCGA' prefix. And you don't need to specify the names in your script. Please refer to the dryrun script here as a good example: https://github.com/Sage-Bionetworks/NCI-CPTAC-Challenge-Examples/blob/master/sc2/Dry_Run_SC2.R. Please let me know if you are still puzzled. Best, Zhi
Hi Zhi Li, Does this mean that in sub challenge 2 the prediction is done to a dataset where the sample size (N) = 20? I understood earlier that although the number of genes in the dataset is known, the sample size is unknown? Best, Ari
Thank you for your reply Zhi. The samples are labeled in the form TCGA-04-xxxx. So by 20 samples, do you mean 20 unique numbers after TCGA- (replacing 04)? Sorry, I am still a little confused, David
Dear David, There are 20 samples each for CNA, RNA and proteome. They are ordered the same way in each of the test data. You need to predict all of them. Best, Zhi

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Hi, How can we select relevant samples in the CNA and RNA test data if we do not know the samples in the proteomics test data that we must predict? David

Drop files to upload

Protein samples for subchallenge 2 page is loading…