For SC1, we assume peptide does not exist at the 0 spot in the True intensity data set.
Created by Thomas Yu thomas.yu Hi Pei,
Again thanks for the feedback. Yes it is important to clarify this issue in the challenge description and it would not hurt if the deadlines were postponed due to this issue.
However, bear with me for a second so as to clarify what is the purpose of this subchallenge.
The purpose is to make imputation without any supervision, meaning we have to fill in these blank values with our own estimation. Right? However, Zeros account for about one third of the missing values (as we have seen in the example datasets) Also, as you know, the error for mis-estimate zeros is by far the most important component in the problem. Furthermore we have no information about protein identities so it is impossible to use prior knowledge to estimate anything. And in the test sets no zeros will be present, by your own admission. This said, how then can ANY algorithm on earth predict zeros? Any imputation algorithm will try to fill in the missing values with the available information, and no zeros are extant and no previous knowledge is possible to use. An algorithm will be as likely to predict a zero as a 100 or -1000!
As I told you the problem you are proposing has no correspondence to a real world problem. Putting it in another way it is like you have a storehouse full of closed boxes. You open 1000 of those boxes and there are only tomatoes in each, and then say to the modeler: "in this storehouse there are also boxes of oranges, now use your tomato information to identify the boxes of oranges"!
Please think about this carefully and let us know if this is in fact the nature of the subchallenge.
best
--Andre Hi, Andre,
Apologize for the confusion.
Sub-challenging 1 is set up as an imputation problem, which is an un-supervised learning problem. i.e. there is no "training" data to guide the modeling......For example, one commonly encountered application of this type is imagine completion, where, people want to figure out what's the missing part in an imagine by studying the imagine itself (without seeking for "training" imagines, which could be quite tricky). Imputation in omics data sets are newly emerged problems...... Commonly used methods for imputation, like KNN, missForest, and low rank matrix completion, are all designed to solve this "un-supervised" learning problem and the data matrix with NAs is the only information source.
As many other "un-supervised" learning problems, performance evaluation of imputation is usually done based on decoy data sets. Thus, we provide the training sets so the participants can easily evaluate their methods' performance. We don't expect the participants to learn "biological rules" from the training data (the relationship between the training data and test data sets are like totally different imagines of same category).
Again, we apologize for the confusion here. We will add related clarification to the challenging website.
Best,
-Pei Hi Pei,
Thank you for your input, but I find your answer perplexing. In the first place the fact that protein IDs do not match between the training set and the testing sets is a total game changer and the worst thing is that ** it is not mentioned elsewhere in the challenge description**. This is very serious, and it is unfortunate that this is only known two days before the leaderboard submissions start. Many people here are wasting a lot of resources on this challenge, just to know something that probably hampers all their previous efforts.
Secondly, this little new bit of information makes the challenge almost impossible. We are given a training set where the not only new cases are absent but also the variables are meaningless for the testing set. So for instance in our modeling effort we may find that knowing proteins X and Y values allow us to estimate Protein Z values. This information is useless, because these proteins may map to to total different entities in the testing set!
The idea of the challenge is interesting but it is totally irrealistic and does not by any stretch of the imagination maps to a real problem. Even if we are dealing with different situations, diseases and test cases, each protein keeps its identity, no matter where it is measured. Knowing the protein IDs would be useful naturally but, if that is not possible, at least they should keep their identities through testing and training sets.
best,
--Andre Hi, A,
The protein IDs in the training set does not map to the protein IDs in the testing set.
In practice, we often need to analyze data from a different disease, tissue type, or even species, where the "biological" knowledge we gained from previous/existing data sets will not help.... So, in this sub-challenge, participants are expected to build a imputation function that does not rely on any previous "biological" knowledge (derived from existing/training data).
Best,
-Pei Hi all,
We understand that predicting true zeros in subchallenge 1, besides the missing values, is the real life challenge of interest. However, predicting true zeros may be definitively more efficient using the non-missing protein identification, which is also part of the real life situations.
Or should we understand that challenge test data will use the same Protein_1..7927 ids as in training data set ?
Best,
A
Hi, All,
It looks like the missing pattern is for mass-spectrometer that does not have any censored missing, and only technical missing or biological missing. NA of a peptide intensity could be a technical missing or biological missing value. Agree that zero is not the same as NA, the data from lab I used , zero is not missing but a censored value which means that the value is under a detectable limit and not a missing. We have prepared paper to cope with censored and missing , there are a few papers discussing detectable limit and missing too. The missingness really depends on the devices used, and some missingness have associations with the experimental variables too.
From the answers given by Thomas and Pei, I think the missingness from the challenge data set treats the censored and biological missing or technical missing the same as there are no auxiliary information for the NA or Zero in each obs file.
Irene
Hi, Pei,
Thank you for your explanation and patience with us.
In any sense, as so many participants raised question and suspicion over sub 1 data, I apologize I don't get why couldn't the organizers release the simulation code? If it is right, what's the problem with releasing it? If it is wrong, then it makes way more sense to let the participants debug for you.
Furthermore, I don't think we should expect it is correct. I know I have been irritating for repetitively saying so. But if this is even possibly correct or remotely acceptable, I would rather not say anything.
One reason is based on all of our observations. Unless you are telling us that all these 10 teams are just operating at a much lower intelligence level than whoever made the simulation. That probability is low.
Can I ask did you blindly write the code a second time before you confirming that it is correct and let all of us go ahead?
Did you even take a look at the code when you answer these questions?
If the answer is no, can you please tell me what could make you confident that we are all blind and the single student one time writing of the code was correct? We deeply appreciate you answering questions on the forum, which is way better than most PIs who never appear until signing the authorship. But we all prefer more informative answers.
The other reason is that the sub2 and sub3 data have been changed more than 10 times now. That is simply making a cutoff to existing data. Then, sub1 requires simulation, which is much more complicated, the data has been there forever. Unless you are telling us that sub1 is done by a different, much smarter and professional student than whoever did sub2 and sub3. Otherwise, expectation maximal likelihood, sub1 is not going to be correct.
Thank you very much for explaining to us. And thank you for your patience in advance.
Yuanfang Well, that's the challenging of this problem. In real experiments, we never know whether the NA is a true zero or a technique missing. We hope that participants can come up with smart ways to better address this. The answer given in the webinar leaves the most important thing open. The problem with zeros is that when we look at the observation files, ALL the zeros are NAs in all the files
Why is this important? because this way we have no way of making models capable of predicting zeros, which is the largest factor for error by far. If our goal is to predict the NAs from the observations and no zeros are present to train the models, the problem is near to impossible. It is like if we want to predict which people have cancer and we have a population with positives and negatives split into training and testing sets, yet all positives (people with cancer) appear only in the testing set. No matter how good our models are, it will be impossible to predict anything because the training set has no single positive!
Another way of putting the same question is: Why there are no zeros in the observation files?
I made this little R script to try to prove my point. Although I use data_obs_7.txt. This occurs in every file
Thanks for your help
--Andre
```
#first get the truth
trt<-read.table("data_true.txt", header=T, row.names=1)
trt[1:5, 1:5]
#mark which values are zeros
tzeros<-1*(as.matrix(trt)==0)
tzeros[1:5, 1:5]
#read the observations
obs1 <- read.csv('data_obs_7.txt', row.names= 1, sep="\t")
obs1[1:5, 1:5]
#get the NAs
oNAs<-1*(is.na(obs1))
1-oNAs[1:5, 1:5]
#are there any zeros from truth that are NOT NAs in the observations?
#if so then sum(tzeros*(1-oNAs)) != 0
sum(tzeros*(1-oNAs))
#this will print out a zero, therefore all truth zeros are NAs in the observations
```
Drop files to upload
(Webinar #1) What are the zeros in data_true.txt for challenge 1? page is loading…