i found there is only producing a model. how then use this model to do predictions? i am confused because from the submission file sample, it looks like we submit one file, then all this file according to the cold would go to training, then how do we call test?
Created by Yuanfang Guan ???? yuanfang.guan @vaibhavtripathi- As [described](https://www.synapse.org/#!Synapse:syn8717496/wiki/466771) you are required to submit features for *all* recordIds, even though we will only use some of them for the purpose of primary scoring. The remainder may be used as tie breakers and will be used in the Community Phase analyses. A template has been provided, so please check against it to make sure you include all samples. actually i just found out this is the problem:
This is the cause of the problem:
905 1 1 1 1 1 1 1 1 1 1 1
I have a set of examples (905 in total) that all have the 10 wavelength i identified, so they are all 1s, so they are tied values.
so essentially, to let it run through knn, we will have significantly reduce the performance .
The fix of this bug in Caret:
https://github.com/topepo/caret/blob/master/pkg/caret/src/caret.c
line 14
define MAX_TIES 1000 (which causes the error in line 58-62)
to
define MAX_TIES 5000
as I remember, there is somewhere in the knn.c that it calls which also defines max_ties=1000. it is not on the top of my head where that is, but that line needs to be fixed as well.
but I don't know if you should go ahead and fix all problems with this package, there are probably more than 50 of them to guarantee all submissions to run through. i am not criticizing caret, to be fair, nothing comes out of R is reliable.
please advise how to proceed. instead, you could just take the maximal value of all base-learners (so that if one of the packages fail, it is fine) or just taking the final prediction. As a participant, I would suggest adding noise to "same" values ... similar to randomness, but keeps the imputation protocol in place...
the scale of the noise will be important.
non-finite variance and multicollinearity are still concerns of such models.
Random Forest is generally the most robust against non-finite variance and multicollinearity @sieberts For the submission, we have to give all the recordIds in the 3 datasets or only limited to the 2 medication statuses(for training and supplement data) as pointed above?
actually in addition to the svm problem, i just found knn could also fail under some unexplained conditions (because i am just doing 5 fold CV multiple times, it fails in 3 out of my 10 times in total, one time svm, one time knn and one time nnet, but successful in the other 7 times) : this happens when i fill in the missing values with the same value, and then it gives a prediction of complete randomness. I assume it will happen to a significant fraction of participants, who fill in missing values with mean/median or zero.
Error in { : task 1 failed - "too many ties in knn"
Calls: caretList ... train.default -> nominalTrainWorkflow -> %op% ->
In addition: Warning messages:
1: predictions failed for Resample26: k=5 Error in knn3Train(train = structure(c(-1.34908599792031, -1.34908599792031, :
too many ties in knn
2: predictions failed for Resample26: k=7 Error in knn3Train(train = structure(c(-1.34908599792031, -1.34908599792031, :
too many ties in knn
3: predictions failed for Resample26: k=9 Error in knn3Train(train = structure(c(-1.34908599792031, -1.34908599792031, :
too many ties in knn
Execution halted
(4875, 11)
I found the problem could be removed when I insert random values as missing values. But I think that puts too much constraints on how we do it to go through the intrinsic bugs in these R packages, instead of focusing on developing the right method.
That is exactly why i asked to publish the scoring code way before..... if it was published two month ago, we would have been able to identify and address these problems way before.
though may be a bit late. will the challenge accept a direct prediction? in general, i found directly input my features into random forest performs 0.05 higher than the ensemble code being provided, and it never fails. Just to clarify, the two medication statuses we keep are "I don't take Parkinson medications" and "Immediately before Parkinson medication". We hope you'll still put effort into extracting features that will be useful for other medication statuses, though since this will be one of the analyses we consider in the Community Analysis Phase. Hi Nao:
Not exactly. We know that a PD participants performance varies relative to when they took their medication. Sometimes to the point where their performance is close to controls. Also, only individuals with `professional-diagnosis=True` will have medication status. To diminish the effect of medication we will only use the records when PD patients are off medication vs controls in training and test. Hi organizers,
So, I really need a clarification of this because I am finding a very concerning and contradicting statement here.
On the challenge question it says the the sub challenge 1 task is :
**"Extract features from the mPower Data walking and balance tests which can be used to distinguish Parkinson's patients from controls (professional-diagnosis = TRUE/FALSE)"**
But you mentioned that:
**"As for medication, for the purposes of scoring during this phase, we filter Parkinson's patients by medication status and only keep instances where they are off medication. We will use the "on medication" timepoints for exploratory analysis in the Community Phase". ** .
This means that the predictions or features should not be generated based on (professional-diagnosis = TRUE/FALSE), but rather it should be generated based on (professional-diagnosis = TRUE vs professional-diagnosis = FALSE && off-medication).** Is this correct? **
And the definition for off-medications are "Another time", "I don't take Parkinson medications", and "Immediately before Parkinson medication". **Is this correct?** .
I feel like this is very important and it can affect the quality of features significantly.
actually it is a subset of training as i do cross-validation. but i will send you the features in a second, and two scripts both taking in this file (modified from your file for the read-in part slightly to run locally), one with svm and one without. you will see that the one with svm dies. why the other is just fine. @MonteShaffer, submissions will be ranked based on AUROC for predicting PD diagnosis.
You may submit as many times as you like prior to the deadline, but only your final 2 submissions will be scored and posted publicly. There is no leaderboard. Scores will be posted after the deadline after we have validated the submissions. I suggest running the scoring code provided to see how your features are doing on the training set in order to optimize your solution prior to submission.
Submission deadline is 5pm PDT/8pm EDT on October 1st. This info was on the submission instructions page, but I have added it to the dates page for clarity.
@yuanfang.guan, the current file contains all of the training data being used. Would you please email me the set of predictors you're using that's causing SVM to fail so I can verify? Thank you! Actually I still have difficulty running the scoring code with the following error
line search fails -0.9604691 -0.3990996 1.16161e-05 -5.520403e-07 -1.256131e-08 -3.237231e-09 -1.441263e-13line search fails -0.9753169 -0.3737174 1.41475e-05 4.180014e-08 -1.565572e-08 -4.434284e-09 -2.216747e-13Warning messages:
I found out the reason roots in an intrinsic bug in the SVM library called in caret:
https://github.com/cran/kernlab/blob/master/R/ksvm.R#L3092
Line 3090-line 3094. Essentially, if there is one or several accurate features within the feature set, the program fails.
To fix this bug, there are several ways:
1. one need to change this line in the R SVM library source code: https://github.com/cran/kernlab/blob/master/R/ksvm.R#L3009
minstep <- 1e-10
changed to
minstep<-1e-15
This would fix my issue, but if there is some one who has more accurate features, they could still fail this training
2. I found when I remove the top 2-3 most accurate features, this problem is gone. As a result, I think this error shall appear to all the meaningful and reasonable submissions from teams.
3. removing the line 'svmLinear = caretModelSpec(method="svmLinear")' My current features run ok when SVM is gone (or fixed), but **i wonder if the scoring code could be made so that a base-learner can be dropped out, if it happens to fail**. The reason I suggest this is that I found even if you fix this svm bug, the nnet could still fail on certain scales/numbers of features.
thank you very much for letting me know your are not the organizer of the challenge.
Yuanfang Guan Hi Yuan,
My answers are just my own. I am not the organizer of the challenge. got you. solly. just to double confirm that current example file is just a demo (i.e. non-complete, right?) We are trying to predict a binary choice of this (0,1) as PD condition; however, `this` would be a more appropriate prediction. Can I distinguish an "on" PD patient from a healthy nonPD participant? Can we predict (0,1,2, 5)? About the medication levels, I have internally coded the following:
```
overall_health_as_medtime:
0=not on meds [off]
1=somewhere in between
2=on meds [on]
3=unknown
5=healthy, no medication
```
If the data is missing since 86% of the data is healthy individuals, I will score NA as a 3. The confusion is the lack of clarity.
* AUROC as the scoring algorithm. Other challenges clearly define the optimal response. See https://www.kaggle.com/c/passenger-screening-algorithm-challenge#evaluation
In this scenario, we are predicting a binary choice (0 or 1) regarding PD condition. This can easily be computed into a number.
`argmin` or `argmax` ...
* How scoring works? Is there a public/private leaderboard. If we prepare a CSV and submit, will we see a score as feedback to tweak results before the final rankings? The final rankings are due in a week.
* When is the deadline? October 1 at 11:59PM? This is not clear.
I recognize that the goal is to find meaningful features from the data. Getting the best score is generally a function of having the best observable/repeatable measures (features). See https://www.kaggle.com/wiki/WinningModelDocumentationTemplate
So my expectation as a data scientist trying to support PD research is this: your sample submission.R file is an example of how I can connect to the synapseClient, upload my CSV, have the Virtual Machine analyze my data and give me a score back where the goal is to minimize error or maximize variance explained. Pleae clarify AUROC.
Yuanfang, if you examine the demographic data being merged in, it contains only the recordIds being used for scoring. We have not included the actual filtering step in the code. I hope that clarifies the situation. You can assume a similar filtering has been applied to the test data. I am sorry. but
> We only retain records which are before medication or for those who do not take medication.
This is the training set:
74
11307 "Another time"
12122 "I don't take Parkinson medications"
5815 "Immediately before Parkinson medication"
5313 "Just after Parkinson medication (at your best)"
1. You only retrain the 12122 and the 5815 lines? and throw out both "Another time" and "Just after Parkinson medication (at your best)"??? that is throwing out 3/4 of the positive examples.
i don't know if this is final decision or still in discussion. But it at least doesn't seem reasonable.
2. In your code:
groupvariables<-c("healthCode", "medTimepoint", "professional.diagnosis", covs_num, covs_fac) #"age", "gender") #, "appVersion.walktest", "phoneInfo.walktest")
and then this is the line where they got merged:
mediantraining<-dttrain[, lapply(.SD, median, na.rm=TRUE), by=groupvariables, .SDcols=featurenames ]
mediantraining<-as.data.frame(mediantraining)
All I can see is that is grouped by med time points, but I see no where it indicates that some groups are thrown out. Yuanfang, this is done when the demographic data are merged with the features. We only retain records which are before medication or for those who do not take medication. Actually, in your scoring code, what is being done is (for both training and testing), an individual before and after medication is used as two separate examples. So you mean there is another separate code that filter out the ones 'on medication' (in either training and/or test?)
Sorry, Yuanfang. I'm not sure what you mean by age and gender being bounded to the examples. Please elaborate.
As for medication, for the purposes of scoring during this phase, we filter Parkinson's patients by medication status and only keep instances where they are off medication. We will use the "on medication" timepoints for exploratory analysis in the Community Phase. i see. so only age and gender is bounded to the examples?
as i read this code, it looks like an individual before med and the same individual after med is considered as two different examples in evaluations. is an individual after med becomes 0 in gold standard label? Hi Yuanfang:
Even though a single file is submitted the model is only trained on the rows that are in the [training samples](syn10146553). The score is then computed based on the prediction of the testing samples.