Hi,
How can we select relevant samples in the CNA and RNA test data if we do not know the samples in the proteomics test data that we must predict?
David
Created by David McDonald DavidMcDonald93 . Thanks! Dear Manuel,
For example, CNA in the testing data will have the same number of genes as the training data (11859). The same situation applies to RNA and protein. The genes across different omic platforma share some but not all entities. You need to build a model to predict all 7061 proteins in the testing proteome even some of the genes/proteins are not available at their CNA or RNA level. The question here is to explore whether using information of other genes will help us predict the abundance of a particular protein if the genetic information of this protein is unavailable. I hope it is more clear this time.
Best,
Zhi I'm a bit confused by "the number of features (genes/proteins) for the three test matrices will be exactly the same as the training matrices." The training matrices do not have the same set of genes for RNA, CNV, proteins.
Will matrices (mRNA, CNV, and the prediction we are to provide) will have exactly the same dimensions? Because the express lane data for subchallenge 2 has different dimensions:
wc -l express_lane/sc2/*
11860 express_lane/sc2/prospective_ova_CNA_median_sort_common_gene_11859.txt
7062 express_lane/sc2/prospective_ova_pro_gold_express.txt
15122 express_lane/sc2/prospective_ova_rna_seq_sort_common_gene_15121.txt
To summarize: **Will the test data be a consistent subset of all of those, or are we expect to predict protein where we may have CNV but no RNA for example?**
Thanks, Yes. And this is the same for subchallenge 3 as well?
David Dear David,
That is corrrect.
Best,
Zhi Thank you again for your reply Zhi,
So, just to confirm, there will be a fixed number of samples (20 for the first round) for the CNA, RNA and proteomic test data. The number of samples will be the same for all three test matrices and will be ordered the same way across all three data sets. Also, the number of features (genes/proteins) for the three test matrices will be exactly the same as the training matrices.
David Dear Ari,
Yes, for the first leaderboard, we have 20 samples. But for the second and final round, we might have different sample size. We did not reveal the sample size for the leaderboard as it in principle doesn't affect how you load the data. Please refer to the dryrun script above as an example model for subchallenge2.
Best,
Zhi Dear David,
The testing data have independent samples rather than TCGA collection, so they don't have 'TCGA' prefix. And you don't need to specify the names in your script. Please refer to the dryrun script here as a good example: https://github.com/Sage-Bionetworks/NCI-CPTAC-Challenge-Examples/blob/master/sc2/Dry_Run_SC2.R. Please let me know if you are still puzzled.
Best,
Zhi
Hi Zhi Li,
Does this mean that in sub challenge 2 the prediction is done to a dataset where the sample size (N) = 20? I understood earlier that although the number of genes in the dataset is known, the sample size is unknown?
Best,
Ari Thank you for your reply Zhi.
The samples are labeled in the form TCGA-04-xxxx. So by 20 samples, do you mean 20 unique numbers after TCGA- (replacing 04)?
Sorry, I am still a little confused,
David Dear David,
There are 20 samples each for CNA, RNA and proteome. They are ordered the same way in each of the test data. You need to predict all of them.
Best,
Zhi
Drop files to upload
Protein samples for subchallenge 2 page is loading…