I've been looking at the 'data_true.txt' file in the subchallege one dataset and have noticed that there isn't a clear correlation between number of zeros (missing values) per row, and the mean row intensity. This is in conflict with proteomics data that I've seen before, which tends to be left censored due to noise floor drop out. This is to say that I would expect low intensity proteins to have a higher likelihood of being missed, but I'm not seeing that here. I'm probably misunderstanding something about the data set and am hoping someone could clarify.
Thanks,
-Jeremy
Created by Jeremy Jacobsen jjacob_cub Great. Thanks. Missing values in 10 training data sets are following same mechanism, and are missing not at random (MNAR). Hi Teng,
Based on what was discussed above, I think that the data_true.txt zeros are MCAR, whereas the testing data NaNs are NMAR. Please correct me if I'm wrong. @pacificma
When producing the 10 test dataset, did you try to simulate MAR or NMAR, or a mixture of both?
i.e. Is the missingness based on biological properties or completely at random? yes, you should XD A follow-up question: should we assume the test datasets were produced in **exactly** the same way? Purpose of generating zeros is to match with the missing pattern of testing data, which contains non-negligible zero observations. And result evaluation includes assessing prediction accuracy on both zero and non zero values.
Thank you for the clarification Weiping. Now I think I understand how the data was produced, but I don't understand why zero values were added to the real data. It seems that any model would perform imputation better on the hold out if it ignored the zeros since they aren't in the real data? Data_true came from a subset of completely observed real data, and we simulated some zero values as part of the underlying truth.
I'm still confused.. Is the data_true.txt data simulated, or is it a subset of real data chosen such that there is no correlation between intensity and number of missing values?
First part, you are right.
Secondly, the data is simulated based on a subset of real data. And if the cell you mentioned means the batch, there was no replicated measurement.
Weiping Thank you for the reply Weiping. Just to be clear, the zeros in the data_true.txt file are proteins that were biologically not present, and the NaN cells in the other files are simulated to be missing based on assumptions about mass spec. drop out? Is the data_true.txt data real or simulated? If real, were there multiple replicate measurements that went into the intensity for each cell?
Thanks again,
-Jeremy data_true.txt is the underlying true value of all proteins and all samples. where zero exists is a true zero.
Missing spots in 10 observed data set consist of: all zeros and some non-zeros.
Pattern of zero does not depend on the intensity of proteins, while non-zero missings can be dependent on the intensities.
Weiping