> Note, in the previous leaderboard runs, since the sample size of the test data is extremely small (n=20), only proteins/phosphosites without any missing values were used for evaluations. In the final run, the sample size is larger (n=62 for ovarian and n=108 for breast cancer), so a more relaxed cutoff based on missing rate is implemented.
@MI_YANG @thomas.yu
Through out all previous communications, we were told we are evaluated on those with >5 examples in the test set in round 2. So I thought the evaluation score could make sense. Now, today, is the first time that I learned that only proteins/sites without missing values were used. To be frank, to me, that says there is something seriously wrong on your sub3 on line data and/or evaluation, which we are unable to debug due to docker, etc.
but, can you please do us a favor to re-evaluate round 1 and round2 using your final code? There is some strong inconsistencies between what we observe in experiments and what we see on line. We need to know whether it is just an error in evaluation in previous round, or an error in the online data itself.
Thank you.
>Sub2 & Sub3: Proteins/phosphosites with missing rate <30% will be considered for evaluations in both the ovarian and breast data sets. The cutoff of 30% missing rate is chosen based the following factors:
so sub2 data is also going to change completely ? Can I ask what is the problem here? Can I ask who , which team asked you to do this? Even yestoday I was clearly answered that sub2 stays the same only sub3 changes.
'Dear Hongyang,
The scoring metric for sc2 is at least 5 observations. It's really the exact data we provided in the filtered and the evaluation_data folder.
Sorry for the confusing.
Best,
Mi'
Can you explain to me what happened to lead to another sudden change over last night?
any scoring change will change the algorithm completely. I strongly suggest in the future, in your organizing team, there should be at least one person who has participated in a challenge before.
Created by Yuanfang Guan 小黄药师 yuanfang.guan Thank you very much for your feedback Mi, but I need to analyze this paragraph:
>The algorithm should not be designed to fit the scoring metric in any way.
Educate me what should an algorithm fit to.
> The focus of subchallenge sc2 and sc3 is to predict protein abundance and phosphorylation abundance. We do not see how reducing the subset of phosphosites/proteins should affect the algorithm in any way. Again, this rule is just meant to more robustly evaluate algorithms and not let variables with too many missing values drive the scoring.
Well, that's a misconception of the problem. The ones with more missing are the lowly expressed ones, the difficult to predict ones and also the biologically meaningful ones. Your cutoff of 1300 only tell you how well you do on house-keeping genes. The genes that really driving the cancer process, specificity in cancer pathways are out. My major is actually biology. Machine learning marginally ranks within my top 10 best subjects.
> Normally, we shouldn’t even reveal the scoring metrics in detail.
Please name one challenge whose scoring metric is not released.
> The goal is for participants to try what they think is the best based on their biological knowledge. It takes way more than a winner to make it into Nature Methods.
I agree. I think the biological setup is not ideal as explained above and I never understood why we are so obsessed with Nature XXX. They look like magazines and publish thousands paper a year. I strongly recommend Bioinformatics, the number one journal in our field edited by the most respectful bioinformaticians today, and also free in publishing papers.
> The concept of challenge is no longer as hot as 7 years ago.
That's something you should say in your internal meetings, but not to all the participants, your boss didn't tell you?
Regardless. Just please go ahead and fix the scoring and the leaderboard.
Thank you so much!
>Tom implemented the 0% NA rule for both sc2 and sc3.
the implementation we think is actually 100% regardless of NA.
so please do check missing rate <30% is not mistaken as only taken as missing rate >30% or 70% this time...... i don't know about you..... i would be broken if start all over again.
Dear both,
1) “Through out all previous communications, we were told we are evaluated on those with >5 examples in the test set in round 2.”
For round 2, there was a misunderstanding between us. I originally thought it was >5 examples for sc2 and 0% NA for sc3. Tom implemented the 0% NA rule for both sc2 and sc3.
2) “so sub2 data is also going to change completely ? Can I ask what is the problem here? Can I ask who, which team asked you to do this? ”
We provided all participants the data with at least 5 measurements in both training and test data, but then only proteins/phosphosites with a low missing rate (< 30%) were meant to be scored in both challenge sc2 and sc3. We apologize for the confusion and not to have updated the wiki properly. No team made this request and, as mentioned in the letter, this rule was made in order to more robustly compare different algorithms. In fact scoring metrics are affected by the sample size of phosphosites/proteins and we want to avoid phosphosites/proteins with low missing rates drive the ranking.
3) “my team mate has checked your previous code. It is neither what you described previously or you described in yestoday's email.. What it calculates is a (pretty much random, please refer to the bug reported by another participant before) subset of the 20 samples without any filtering!!!”
There was indeed a bug in the scoring, after debugging, it didn’t make any difference in our dry run. But still, we should run the code again to be sure. As mentioned above, there was a misunderstanding among us. For both sc2/sc3, there was a subsetting for protein/phosphoproteins with all observed (0% NA) in test set, in scoring function.
4) “Any scoring change will change the algorithm completely.”
The algorithm should not be designed to fit the scoring metric in any way. The focus of subchallenge sc2 and sc3 is to predict protein abundance and phosphorylation abundance. We do not see how reducing the subset of phosphosites/proteins should affect the algorithm in any way. Again, this rule is just meant to more robustly evaluate algorithms and not let variables with too many missing values drive the scoring. Normally, we shouldn’t even reveal the scoring metrics in detail. The goal is for participants to try what they think is the best based on their biological knowledge. It takes way more than a winner to make it into Nature Methods. The concept of challenge is no longer as hot as 7 years ago.
5) “but, can you please do us a favor to re-evaluate round 1 and round2 using your final code?”
Yes, leaderboard results will be updated using the new scoring function.
Best,
Mi
@saezrodriguez and @MI_YANG
my team mate has checked your previous code. It is neither what you described previously or you described in yestoday's email.. What it calculates is a (pretty much random, please refer to the bug reported by another participant before) subset of the 20 samples without any filtering!!!
That said, the new evaluation would be completely different from what is shown on the leaderbaord, as it will be 1300 sites versus 15,000 sites (and random sub-sampling) and completely biased sampling in terms of sites/proteins. Please I beg you to reevaluate both rounds today with your final cutoff and debugged code! @saezrodriguez
people will be angry if you organize challenge this way. please reevaluate sub2 and sub3 (and maybe also sub1) using the final metric. All prediction files are stored.
Tom can do it today. I beg you please just email him and let him do it. Dear organizers,
1. Is this really the case that the first two rounds in sub2 and sub3 were evaluated on only proteins without any missing values? From the reply to my post from a month ago it seems that proteins with some missing values were accounted for:
https://www.synapse.org/#!Synapse:syn8228304/discussion/threadId=2697&replyId=13342
2. Additionally after the leaderboard of round 2 has been published it turned out that there was a minor bug in the scoring script:
https://www.synapse.org/#!Synapse:syn8228304/discussion/threadId=2841
Have you verified that with the corrected code there are no changes in the leaderboard?
3. Can you confirm that the information on the change of the scoring metric in **sub2** is not a mistake? 3 days ago we were told that the scoring metric for sub2 stays the same (proteins with at least 5 observations):
https://www.synapse.org/#!Synapse:syn8228304/discussion/threadId=2878&replyId=13979
Now it turns out that not only the scoring metric changes 6 days prior to the final deadline, but that the first two rounds were evaluated using a different metric than the participants thought (proteins with all observations vs proteins with at least 5 observations). In such a situation I also think that it would help if the leaderboards for both rounds are recalculated as suggested above.
Best,
Jan
Drop files to upload
Can the organizers clarify this email and re-evaluate round2 and round1 using whatever metrics is going to the end, please? page is loading…