The scoring code for the L-Dopa challenge has been updated and an abridged version which only requires the training data is here: syn11254333
Previously we were calculating AUPRC by taking the integral underneath the precision-recall curve. In the multi-class case (tremor), we performed micro-averaging before taking the integral.
The new implementation computes a non-linear interpolation of the precision values as detailed on page 34 of [this paper](https://www.ncbi.nlm.nih.gov/pubmed/19348640). An integral is then taken under the new precision-recall curve. In the multi-class case, we take the weighted mean of the AUPRCs for each class. In the binary case, no weighting is done, and the same scoring as in the [Respiratory Viral DREAM Challenge](syn5647810) is used.
Created by Phil Snyder phil Oh, yes! That's correct. Weighting is done only in the multi-class case. thank you very much phil for the clarification. sorry for the misleading information above.
just to make sure my reading is correct, only [896, 381, 214, 9, 0] is used in weighting the categories, binaries do not use weighting, right?
Thanks for pointing out those issues @yuanfang.guan! I've updated the code to reflect your suggested changes.
I wasn't using the CATEGORY constant at all in the previous version, so it has been removed. I've also seeded the code so that we can have reproducible results between runs.
actually i am so confused at this code. I had to modify it a bit to be suitable for cross-validation. but there is absolutely nothing i introduced wrongly to the code, I just translated it to a single phenotype loading in train.txt and test.txt so it is easy for me to test.
Then just for some sanity check, when i change
23 #CATEGORIES=[0,1,2,3,4]
24 #CATEGORY_WEIGHTS=[896, 381, 214, 9, 0]
to
25 CATEGORIES=[0,1,2]
26 CATEGORY_WEIGHTS=[896, 381, 223]
it doesn't even report any error, instead, with the previous 2 lines, it reports a score of 0.95, with the second 2, it reports a score of 0.16. While, I am sure, both are wrong, as I have my independent evaluation code inherited from other challenges with interpolation implemented in R and with the same weighting assigned in the code. The test and train category are 5 in total, which is actually the simplest case.
I don't know how to interpret this behavior and how to trust this evaluation results, when such mistake happens, there isn't even a warning or an error report but just goo ahead and produced a score.......
Plus, i run it twice the scores with in two minutes, without any changes, the scores are different.....
###
updates:
1. i think that running twice generated different results is just because no seeds was inserted into the regressors. i think you may want to fix this, the result can be 0.02 different from run to run.
2. i figured out a bug on my side that I used training as test, so it doesn't produce 0.95 anymore, but it still doesn't explain why i truncated category, the score changes but still produces a score, that part is still worrisome.
3. with that bug found when both are five categories, it is consistent with an independent implementation with only 0.01-0.04 difference. I think that is acceptable, as you said final test gold standard is also five categories. Further, i see a significant drop by 0.05-0.1 when compared to the previous scoring code which doesn't count for ties, which is a great sign.
actually i am quite confused about the new code, if it is just two category, it is just the prauc for positives, right? so no weighting is actually used? just to make sure the beginning lines are actually redundant? Thanks for the clarification Phil! This is from a discussion in a separate project:
> Keep in mind that the scoring code is both trained and scored on the training set. I'm able to get AUPRCs of > 0.99 with the released scoring code simply by training on a feature
> that samples from the (0, 1) interval., but I get ordinary results when using the actual scoring code (which scores on the test set after training on the training set).
>
> The publicly released code should be used to understand how we are building and evaluating models on the submitted feature sets, but the score shouldn't be taken as an indication
> of generalization performance.