Hi.
I have a unclear point..this sub challenge 1.
As we know.. we got all true protein values in data_true.txt and some protein values in 10 training data(data_obs_1~10.txt)
Should have missing protein values in train_obs_*.txt to be predicted only by using other proteins in same train obs file???
we can see that the protein_1 missing value at Sample_2 in train_obs_2.txt is same as protein_1 value at Sample_2 in train_obs_3.txt
Point is...why do you have a plan to release so many test files(100 test files may be, and contain 80 samples each txt file?)..
I guess the reason is to simulate missing protein pattern in real experimental environments.. right?
Sorry about my poor understanding of data set and grammar.
And very thanks in advance~!!
Created by sungsoo park deepimagine1 NRMSD and COR are summarized for each protein of each data set separately, and then we take average of them across all proteins and data sets When testing for Sub1, as we supposed to calculate the NRMSE across all simulated datasets for each protein, or for each protein within a simulated dataset. In the former we would have 1 NRMSE and in the latter we would get 10 NRMSE (1 for each file)
The line "For each protein, we first calculate the Normalized Root Mean Square Error (NRMSE) between imputed values and underlying truths across all samples from all simulated data sets. " is a bit ambiguous. The model can be dependent on the data itself. But you have to provide a single function (in another word: a systematic way to specify the model) that can be applied to each data set separately.
Please have a look at the discussion in Topic: '**test data, sub 1 chanllenge**' and '**Data Description, Subchallenge 1**', and let me know if you have further questions. I have a similar question. I don't understand very well what we have to do in this challenge. So far we have 2 different interpretations:
1) Using the data provided for training, we learn a model that does imputation. Then we use that learned model to impute the 100 datasets on the test set.
2) We create a function that receives as an input one matrix with missing values. Using that matrix, we learn a model that performs imputation exclusively in that matrix. That means that at testing time we would have 100 models (Since a new model will be created each time that we call the function).
Are we in scenario 1 or 2 (or a different one)?
In other one, can we assume that the 100 datasets on the test set are coming from the same distribution than the training data? Can we also assume that Protein_1 in the training set is the same Protein_1 in the test sets (and the same for the rest of the proteins)?
Thanks! you need to provide a function with one observed data set as the only input, and return the corresponding imputation result.
multiple training and testing data set were released, in order to help participants understand the missing mechanism and evaluate the results in a fair way (avoiding artificial effect of a single data set)
Drop files to upload
Right way to train and test for Sub challenge 1. page is loading…