Aren't we just learning the simulation model in subchallenge 1?

I am wondering about something that could be a limitation of the design of sub-challenge one. The description of the challenge itself admits that “proteomic profiling data from mass spectrometry based experiments often contain a large number of missing values due to the dynamic nature of the mass spectrometry instruments” yet the challenge seems to assume that the CPTAC breast cancer global proteomics data from which the complete matrix is simulated is not subject to this problem. If we are simulating the complete matrix based on the CPTAC data (and some simulation model which attempts to compensate for the possibility of missing data) and then simulating missingness, isn’t our imputation model just learning to recreate whatever the original simulation model is? Is there something wrong with this reasoning, and is the simulation code available anywhere? An alternative may be to use real “truth” data including any inherent missingness and then add more missingness, attempting to train an imputation model to fix that data — then the goal would be to restore real data.

Created by Chris Anderson andechri
i am confident this simulation was wrong, even if the person was intending to simulate some 'true zero events were not independent.', it was not done properly to be reflected in the data, i believe many of the participants share this view by reading this and other threads. I think Jeremy's suggestion is very thoughtful, along with several other insightful analysis brought up by other participants. But as both ONNL and PLM data have been released, we already lost the opportunity to do this meaningful exercise. Then if '3 sub challenges' is deemed necessary, I suggest the third one to be using only genetic data (cnv) to predict either phosphorylation or proteomics. That would still be a meaningful exercise.
It seems like the proper way to prepare the true data would be to have a ground truth based on multiple replicates (more than are taken in a usual experiment) and to therefore fill in missing data that is otherwise missing due to stochastic machine sampling. Any proteins missing due to preparation (digest specific) should be removed from consideration. What is still missing would be proteins that are actually biologically missing. The training data would then be single replicate subsets of the original. This would remove the need for simulation ensuring the usefulness of resultant models.
Thank you, Pei Wang for the explanations. However, as mentioned in my previous post (https://www.synapse.org/#!Synapse:syn8228304/discussion/threadId=2217), our imputation indeed found a huge difference in NRMSE between the "true 0" incidences and "abundance dependent dynamic missing" events. The imputed values of "true 0" incidences were positive and seemed meaningful. Therefore we hypothesized that the different NRMSEs were due to the incomplete account of the real data patterns in your simulation. We understand that a complete account itself is challenging. This hypothesis can be tested if the original values of the "true 0" incidences are provided. Therefore we are wondering if there is any plan in releasing the original full data, or the simulation method in detail, at any stage during or after the challenge?
I can only say that true zero events were not independent.
>both (correlated) "true 0" incidences can you please explain this? because I binarized the truth matrix with 1 and 0, there is absolutely zero correlation both horizontally and vertically. (gene-wise and sample-wise)
Thanks Chris, Yuanfang and Abhinav for your interests in sub-Challenging 1! 1. Since missing imputation is an unsupervised learning, we cannot appropriately evaluate the performance of competing methods without knowledge on the ground truth. Therefore, we decided to setup a "decoy system" to facilitate the evaluation in this sub-challenging. 2. The missing values in the training sets are simulated from real data sets. The simulation models resulted from careful studying of the missing patterns in the real proteomics data. Multiple layers of probability models are used to generate both (correlated) "true 0" incidences and "abundance dependent dynamic missing" events. These models provide close approximation to patterns of the real data sets in our opinion. 3. Of course, in real practice, we want to apply imputation methods on real data to facilitate scientific analyses. Participants are more than welcome to apply their methods to the original CPTAC breast and ovarian cancer proteomics data sets that are provided in this Challenge (sub-challenge 2 and 3) and investigate scientific benefit of their imputation strategies. But due to the lack of direct/objective metrics to assess performances of imputation methods on real data, we chose not to go down this route when scoring participating teams. Hope these help a little.
Thanks @yuanfang.guan ! @andechri and I are excited about this subchallenge, and we'd really like to hear the organizers' thoughts on this issue and related issues. Is something wrong with what @andechri is describing above? Are you the best person to discuss with @thomas.yu ?
i second this question. from the observation of this data in sub1, it doesn't seem to be right/realistic. but without seeing the simulation code or more (say, 50 sets of simulation data), it is difficult to tell where it went wrong. then, we will have to see which algorithm happens to better restore a potentially flawed simulation. it doesn't seem to deliver any biological meaning to me --- but still a fun exercise. would the organizer be willing to share the simulation code or more simulated data?

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Aren't we just learning the simulation model in subchallenge 1? page is loading…