As the test data(CNV and transcriptomic data) has different number of genes what should be the subset of proteins that should be scored ? Is the list of protein is intersection or union of genes present in CNV and transcriptomic dataset ?

Created by Priyanka Chakraborty priyankaC
Dear Priyanka, There are missing values in the test data. Thanks, Zhi
I have another query about the test data. Are there missing values in the test data I mean are there genes/proteins in the test dataset that dont have expression values for few samples ?
Dear participants, We've updated the '3.3 - Accessing Data' page. Please find corresponding file names, directories and other information for your modeling there. Thanks, Zhi
We didn't take the common genes among all three data sets for reasons that even though certain proteins don't have their CNA/RNA measurement, we would like to know whether using other genes' CNA/RNA information can help infer the abundance of these proteins. Therefore, please use additional features for you to infer the abundance of those proteins.
Out of that **7061** genes for **6021** of them we only have RNA-seq data, for **4835** of them we only have CNA data which can be tolerated but for **1040** of them we have neither RNA-seq nor CNA value available. For these **1040** genes, are we supposed to use additional features that we compile ourselves?
Dear Priyanka, We will update the details of the testing data very soon. Thank you so much for your interest. Best, Zhi
Dear Zhi, Thank You for the clarification. Could you please give a brief update about the location and filenames of test data ? I checked '3.3 - Accessing Data', folder for test data is mentioned there but I cant find the filenames and format of test datasets. Best Regards, Priyanka
I see. Thank you for the clarification.
Dear Tommi and Ebrahim, Ebrahim is right about the testing data. For the testing data we will only keep the genes shared by both testing and training set. As result of that, we gave you shorter version of the training data on Synapse compared to what you would download from public domain (CPTAC data portal/TCGA firehose). Therefore, for ovarian collection, there are 11859 genes in the testing CNA, 15121 genes in testing RNA, 7061 genes/proteins in testing proteome, and 10057 phosphosites in testing phosphoproteome. And you would have NA values in testing proteome and phophoproteome as in the training. The purpose here is to make your code earier adapted to the testing data without subsetting overlapped genes by your own. We will update the '3.3 - Accessing Data' before round 1 starts. You will know more about the demension of the testing data by then. Best, Zhi
@tsvali: Please **ignore **my earlier answer and sorry for confusing you! I read your question wrong! I do not know what the case will be for the **testing data**!
Okay, so it will be the same as in the training data. Thank you for your answer!
.
For all 7061 genes in the testing data in subchallenge2, will we have both RNA and CNA data (excluding random NAs)?
Dear Alim, The breast testing data is still under production by CPTAC community. As for now, we don't have any update on that. Therefore, for the first round of leaderboard we won't ask you predict breast data for the challenges. We will keep everyone posted on the '3.3 - Accessing Data' page, once the breast testing data is available for the challenge. Best, Zhi
I understand that missing values from 7061 proteins to be predicted are provided in ovarian train data and will be in test data as well. What about breast cancer data? Is test data only about ovarian data set? How many proteins there? Should we assume same 7061 proteins will be required to be predicted in ovarian test data, knowing that the overlap of breast and ovarian proteomics data set consists of 6970 proteins only? Regards.
Hi Ebrahim, Acutually, even better. We provided the updated training proteome focusing on the 7061 proteins need to be predicted in the testing data. Feel free to download that from '3.3 - Accessing Data'. You definitely will know which proteins you will be predicting in that way. Best, Zhi
Thank you Zhi for your quick response! Have I correctly understood from your answer that we will **not have** the **names of proteins** that needs to be predicted in the **second sub challenge**? And this is due to the fact that the **test data** has been held out? In other words, one should predict 7061 proteins from 7167 available proteins in the training data from the **ovarian data set** without knowing exactly which proteins? Thank you in advance for your response. Br, Ebrahim
Thank you everyone for response. Best, Priyanka
Dear Ebrahim, Thank you for your interest in this DREAM challenge! We only considered genes/proteins/phosphosites **overlapped **between testing and training data with at least 5 observations in both collections. As a result of that, we end up with 7061 proteins. Please try to predict all the 7061 proteins in the training data. We didn't take the common genes among all three data set for reasons that even though certain proteins don't have their CNA/RNA measurement, we would like to know whether using other genes' CNA/RNA information can help infer the abundance of these proteins. Accordingly, we've revised '3.6 - Prediction Scoring Metrics' to reflect the update. Please let me know if it is more clear this time. Best, Zhi
Dear Mi, Would you please explain where this number (i.e. **7061**) comes from (has this been explained somewhere in the wiki or forum)? a. If I am not mistaken "training data" refers to the **ovarian **data set. b. if point `a` is correct, then the numbers I am getting does not correspond to **7061**. I. the number of common proteins between the two laboratories (JHU and PNNL) is **7167** II. the number of common proteins between the two laboratories (JHU and PNNL) with less than 30% **missing value** is **7063**. III. Number of common genes among three data sets (i.e. RNA, CNA, and protein) is **4849**. This means that when **testing** there will be proteins where possibly we do not have either or both of CNA and RNA abundances. Is that correct? I would like to thank you in advance for your consideration and response. Br, Ebrahim
Hi, The list of proteins to be scored is the same 7061 in training data. The input features of CNV (11859) and RNA (15121) is up to you to use either of them or intersection of them in your training phase. Best, Mi

subset of proteins that should be scored ? (Sub2) page is loading…