Hi,
In the training data, all data_obs_*.txt are simulated from one file: data_true.txt, and they are of the same size. This means the training datasets are not independent of each other. I assume the test datasets will be independent of each other, and may probably have different sizes (samples). Is that correct?
Regards,
-Han
Created by Han Hu hh1985 . In testing data, all of the data matrix (1 true and 100 obs) will have the exactly same column names and row names (consequently same dimension).
My bad. I mean protein ids and sample ids in the test data (data_test_obs_*.txt). Will they be consistent?
Yes, they are. Protein Id and sample Id are consistent across all data sets (1 true and 10 observed) in training. Hi pacificma,
How about the protein ids and sample ids within data_obs_*.txt in the test data? Are protein id1 and sample id1 in data_obs_1.txt the same as those in data_obs_2.txt?
-Han Sorry about the confusion, you should treat them as independent protein sets between testing and training, and also the samples in Testing data are independent from the training. Hi Zhi,
Thanks you so much! Another question, will the protein Ids be consistent across the training and test data? In other words, if I found protein Id1 is correlated with protein Id2 and protein Id3 in the training data, can I use the information in the test data?
-Han
Hi Han,
100 testing data will be simulated from one ground truth file (data_test_true.txt) with pseudo missing patterns. All of the testing data will have the same dimension (protein*sample). So you should consider them as non independent of each other as well. However, the specific demension of the testing data will be different compared to the training data, information of which will be released before the first round starts. Please refer to 3.3 - Accessing Data for detailed explanation and any update of the data.
Best,
Zhi