Sub challenge 2 evaluation code may be bugged

i will not reiterate the problem. please read the thread here: https://www.synapse.org/#!Synapse:syn6131484/discussion/threadId=604 and confirmation of this bug several days later. https://twitter.com/anshulkundaje/status/761117086915497984?lang=en related to lines: precision, recall, thresholds = precision_recall_curve(y_true, y_score, sample_weight=sample_weight) return auc(recall, precision, reorder=False) can you please take a look? thanks!

Created by Yuanfang Guan 小黄药师 yuanfang.guan
Hi Yuanfang: We are still investigating. As Phil mentioned his updates only affect the cross-validation within the training data. Not your initial point about the AUPR.
i read your newcode so no plan to replace auprc build in? then one doesn't need to do much training, just make some ties at good places, the result can be super high
Good points. I've modified the scoring code to make it slightly more robust - and while my modifications don't once and for all obviate the first two issues (actually they only ameliorate a subset of issue 1), we made certain to split the training and test sets in such a way that each has a good enough class representation that the above issues don't break anything in the actual scoring. Your 2nd point is an ongoing issue with sklearn (https://github.com/scikit-learn/scikit-learn/issues/8100). Looking at the source code, class labels are sorted ascending in sklearn.preprocessing.label_binarize as well as in OneVsRestClassifier and VotingClassifier (if you follow the rabbit hole of assignments, you always end up at self.classes_ = np.unique(classes), which always returns a sorted np.ndarray).
Actually I do find a serious potential bug in your code, which cannot be detected via the training set as train and training set as test: Can you take a look? It happened in 3 of my 5 Cross-validation split in Sub 1. Because the sample size is small, it could happen that in the training set it contains a different categories from test set. It will result in the following potential errors: 1. When training set category is a subset of test set category (or reversely): e.g. training set 0,1,2, test set 0,1,2,3,4, error reports on the line: aupr = pr_auc_score(y_true, y_score, average=average), because predicted score will only be 3 columns, and tested sore will be 5 columns. 2. **More dangerous undetectable situation** When training set category is different from test set, but of the same count: e.g. training set: 0,1,2,3, test set 0,2,3,4, where the shape is the same, but it go ahead and compare 1, and 2, and 2, and 3, then 3 and 4. 3. I am not sure **if in the label binarize function the labels are actually sorted, and it is sorted in exactly the same direction of the function called in one versus all**. From a quick reading their code, I don't think it is guaranteed. Then even both are 0,1,2,3,4, the result could be wrong, due to one is 0,1,3,2,4, and the other 0,1,2,3,4, if it is not consistently sorted, but depending on the appearance of examples, binary classification and evaluation are also affected. Below I mark down the related lines that causes the bug: 47 if n_classes > 2: 48 y_true = label_binarize(y, y_labels) # where y_true has to come from test set y true in order for the below pr_auc_score calculation. The shape is [# test, example, # category in testing] 49 else: 50 y_true = y 51 y_score = clf.predict_proba(X) if n_classes > 2 else clf.predict_proba(X).T[1] # where clf comes from model generated from training set, and X will be of the shape [# example, # category in training] 52 53 print(y_true.shape, y_score.shape) 54 aupr = pr_auc_score(y_true, y_score, average=average) 55 I prefer not to provide a fix. The reason is that I don't think this evaluation and training is appropriate. For example, when we evaluate whether 3 is correct, 1,2,4,5 are all used as negatives. I think a correlation or rank correlation is more appropriate. And then use a set of regression functions in ensemble. Because it makes no sense to train a score of 3 against 1,2, 4,5 (median severity, trained versus no symptom and severe at the same time, clearly doesn't make any sense)
@phil, you may want to check with Abhi or Tom regarding AUPR code in Python that interpolates in the case of ties.
yes, that irrelevant to you. it's skleran. cc solly, she knows this problem of tieing examples
I haven't yet had time to go over all the threads linked to above, but I can talk about how AUPRC is implemented in the example scoring code released (though not yet documented in the wiki) today. (Code here: syn11036920). * I modeled my implementation after the already existing implementation of AUROC in sklearn, so if there is something wrong with sklearn.metrics.precision_recall_curve or sklearn.metrics.auc, then there could be something wrong with the scoring code as it is currently implemented. We'll be looking into this and making sure everything is functioning properly before subchallenge 2 goes live. * If you seem to be getting strange scores from the code released today, that doesn't necessarily mean something is wrong with the scoring. Keep in mind that not only is the code training on the training set, but it is also calculating the AUPRC on the training set. I'm able to achieve an AUPRC of 1.0 by replacing the NAs in the submission template (syn11026476) with two constant features and one random feature. You shouldn't expect that performance to extrapolate to the private test set.
Thanks Yuanfang: We will review and make sure the scoring code is correct. Best, Larsson

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Sub challenge 2 evaluation code may be bugged page is loading…