Dear organizers,
My submission (to subchallenge 2) has been scored with pearson correlation being "nan" and nrmse being numerical, please see below:
###################################################################
Your submission "ffs1" (ID: 9642722) to the Proteogenomics Subchallenge 2 has been scored:
You submission was scored.
corr: nan
nrmse: 2.0195
###################################################################
The usual reasons for pearson correlation being "nan" can be:
1) a "nan" value in one of the predictions
2) variance of predictions for one of the proteins being 0
The reason 1) does not apply, because nrmse is nonzero.
The reason 2) also does not apply, as I am checking variance before printing out predictions and the minimal variance is on the order of 10^(-11).
What can be the reason for correlation being nan in my submission? Is there a threshold for the "minimal allowed variance"?
With best regards,
Jan
Created by Jan Kaczmarczyk jan.kaczmarczyk Thank you!
Best,
Jan Dear Jan, Yuanfang,
You are right. Assigning corr=0 when either variance is 0 is the most efficient way for this type of problem.
I changed the scoring script. Tom will update this tomorrow morning (Seattle time).
best,
Mi Dear Mi,
Thank you for your help with this.
Do I understand correctly that the scripts are validated on all test samples but scored on a subset of test samples? If this is the case then indeed modifying all predictions is the only option.
On the other hand, adding noise to predictions for every sample or replacing with mRNA is a cumbersome procedure and I agree with Yuanfang Guan that it would be much better to redefine the metrics such that the correlation is zero when variance is zero.
Best,
Jan Dear Jan,
I checked your prediction file. It is indeed the problem I thought. For a certain protein, out of the 20 samples: 19 have same value and only 1 has a slightly different vales. So it passes the validation step. But once in scoring step: the one patient where you have a different value is a missing value in our test data. So the rest of data. So your prediction is scored on the remaining 19 or less identical values.
2 solutions: either you generated some noise in every samples, or you can replace with the corresponding RNA or something else.
PS: Maybe setting corr=zero when variance=0 is not such a bad idea. We ll discuss about it and let you know.
Best,
Mi variance of zero should give correlation is zero.
Though it is hard to detect where it went wrong due to the docker environment, that even the code itself has no problem, other things such as feeding in the data, what is being fed in, can lead to such problems. Dear Mi,
Thank you for your feedback. In my code I actually check for situations when the variance of predictions is exactly zero and then I "add noise" manually to these predictions. As a result, the minimal variance in my predictions file is on the order of 10^-11^, i.e. the standard deviation is on the order of 10^-5^ (I have this information in the log files). Therefore, the pearson correlation should be numeric, unless there is a manually-added threshold for minimal variance in the scoring script.
Best,
Jan
PS: the same code went through the Express Lane without problems, whereas in my previous scripts when the variance was (exactly) 0 the Express Lane status was INVALID. your evaluation code has bug. in that case, cor should be zero
Dear Jan,
I think it's reason 2, your prediction might not have variance zero, but the test data is only 20 samples. Maybe for a certain protein the observed samples: less than 20 (because there many missing values), the variance of your predictions for this subset might be zero.
When variance very small as you can see in my dry run, I simply replace it with mRNA (I know it's quick and dirty...)
Best,
Mi