Doing a simple baseline of features, and throws an too-big size error.
```
Error: cannot allocate vector of size 8.9 Gb
TRACEBACK:
4.
array(r, dim = d, dimnames = if (!(is.null(n1 <- names(x[[1L]])) &
is.null(n2 <- names(x)))) list(n1, n2))
3.
simplify2array(answer, higher = (simplify == "array"))
2.
sapply(training_features[, featurenames], as.numeric)
1.
PD_score_challenge1(submitme)
```
Can we get documentation on what the result from the provided function is...
*What the goal is?
*Are we trying to `argmax(resultme$error$ROC)` ... closer to 1 the better?
```
resultme = PD_score_challenge1(submitme);
str(resultme$error)
'data.frame': 1 obs. of 7 variables:
$ parameter: Factor w/ 1 level "none": 1
$ ROC : num 0.917
$ Sens : num 0.988
$ Spec : num 0.826
$ ROCSD : num 0.005
$ SensSD : num 0.00386
$ SpecSD : num 0.00726
```
Created by Monte Shaffer MonteShaffer It depends on how you split your data during cross-validation. Figure 2 and Figure 5 in [Chaibub Neto et al ](https://arxiv.org/abs/1706.09574) illustrates the two different approaches that Saeb et al is talking about. actually i am reading them this morning. but i would assume the pooled values shall be even higher then? i don't really get the point from saeb's paper. when i evaluate by record and by individual, the performance matches with +- 0.02, with the number i posted above. @yuanfang.guan
That paper was using every recordId as an independent measurement instead of summarizing multiple measurements per participant into a single value. Even though this is common practice in mobile health studies it can, in hindsight, lead to overfitting. See for example: [Chaibub Neto et al ](https://arxiv.org/abs/1706.09574) and [Saeb et al.](https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/gix019/3071704/The-need-to-approximate-the-use-case-in-clinical). @larssono
can you tell me in the publication 'mPower: A smartphone-based study of Parkinson's disease provides...' if the reported 0.88 is built from the baseline features? when I use them, I only get 0.606890112, 0.674930051, 0.54895601 AUC respectively for the three tasks (on global population, on age-controlled, it is lower).
@MonteShaffer, I can add some additional information about the age matching - we chose to limit *both* the training and testing to individuals that are older than 57. This lead to a balanced population with similar distribution in age between PD and control participants. When this is done most of the variance will not be represented by age. In fact for most models that we have built in prior work age is not a significant contributor to the the prediction of PD status in age matched populations
As for the importance of understanding variation within PD patients - I couldn't agree more. We are hoping to cover this in the collaborative phase where we will share the medication timing on the additional testing samples. Absolutely agree, Monte. Given that average age of diagnosis for PD is early 60s, it seems that this is the important timeframe to target.
Thanks for the ideas, Monte! Those are certainly things we should explore during the community phase. We would love to try to leverage the younger samples in some way, especially the younger PD patients.
Solly Thanks Solly,
I believe Yuan and I are trying to help, this is why we are voicing our concerns.
Separating aging degeneration from PD degeneration is important. And challenging. But I believe the motion data regarding walking is significantly different between a 60 year old with PD and another without PD.
I believe I have developed some cadence/rigidity features that will be meaningful. The will have a slight age variation, but not much, based on my internal correlation analysis.
I am trying looking for clarity. If "age" is a covariate in the scoring box, and eats up most of the AUC (0.9), how are we supposed to find anything meaningful? I haven't thought too long about this, but I would suggest maybe a residual approach where you remove all age-effect correlations (e.g., with a basic logistic regression), and have us predict on the residuals using this ensemble approach. Thank you Monte. The self-selected, self-reported nature of the data also presents a huge challenge. We have done our best to design the challenge in a way that minimizes artifacts and gives participants an opportunity to be successful (i.e. is doable). As you pointed out yesterday, Yuanfang, if age alone results in an AUC of 0.9, how can we expect to see any meaningful contribution from the walking features? And scientifically, we're not interested in age effects, we're interested in PD effects.
That being said, this data set is actually very rich, and we hope to use it to explore other dimensions of the data (e.g. medication effects) during the community phase. We hope you'll all participate and bring your great ideas of questions we can answer!
As for Stephen Friend's position at Apple, it is true that he remains on our board of directors. However, he is not involved in the day-to-day operations at Sage, and, as always, we and DREAM maintain our commitment to Open Science. We don't have any representatives from Apple on our organizing team, however we do have representation from Verily, but again, maintaining an open science framework in this challenge, no parties will have access to information that isn't freely available. I hope that allays your concerns. re monte:
1.>I am participating because this will benefit PD and MJFF in the long run.
That's exactly why i am participating. and i am upset when someone takes such valuable data not towards the best solution or the goodness of science or all the PD patients enrolled in the study, but as an advantage of personal revenge. if that were not the case, that is great.
2. i don't know much machine learning so i cannot comment.
3. what i was saying is that the evaluation code could produce reversed result because of the first example label as has been discussed numerous times in public forums. i.e. you expect a 0.7, but depending on the first example label only, you could get 0.3 or 0.7. Hi Yuan,
A few comments:
* if you follow the mPower clinical, it is fraught with variation and changes. I do have a conflict-of-interest concern with the founder's relationship with Apple. With that being said, I do not believe the organizers involved with this project are doing anything nefarious. I believe they are doing the best they can with the limited resources they have.
* if you review the second set of data, two wrist sensors with clearly defined activities, the clinical design is much more well defined, and as a result, there are more things that can be done theoretically.
* I am participating because this will benefit PD and MJFF in the long run. Maybe they are looking to lift our code, but end of the day if it benefits the PD research community, I am supportive.
* You appear to have much more machine-learning talent than I have. I am interested in what I believe are some theoretical-implication questions: can I have someone walk around with a phone and assess their status? Being PD/nonPD is not really meaningful, but understanding if PD patient is "on" or "off" will have some benefit to the PD care.
but @MonteShaffer thanks all the same for your explanation. Did you play with changing :
trainoutcome<-factor(ifelse(trainoutcome, "PD", "Control"))
to
trainoutcome<-factor(ifelse(trainoutcome, "Control", "PD"))?
Does it give you the reversed result? I found in my experiments **what is negative is determined by the first example that is being read in, instead of what is being labeled as negative**. It seems to be a well-known problem in caret mentioned in several threads. e.g. https://github.com/zachmayer/caretEnsemble/issues/228
Does it work as expected for you? Yuan,
To your point. YES, it seems bazaar that we would learn and train on a dataset that is not representative of the testing dataset. Why would we analyze motion data on a healthy 20-year old male in the training set, when this will not be used in the testing phase.
I concur that that is inconsistent design. Hi Yuan,
If you see my examples, you can see that I have tried to identify the baseline, and the gold standard.
Solly suggests that 0.56 is the baseline. I created one feature, using `rnorm` and got 0.595 - I would concur with this range as the lower bound.
My gold standard is based on medTimepoint:
```
> trainme = mPower$walking.training;
> trainme$healthState = NA;
> trainme$healthState[mPower$walking.training$medTimepoint=="Immediately before Parkinson medication"] = 0;
>
> trainme$healthState[mPower$walking.training$medTimepoint=="Another time"] = 1;
>
>
> trainme$healthState[mPower$walking.training$medTimepoint=="Just after Parkinson medication (at your best)"] = 2;
>
> trainme$healthState[mPower$walking.training$medTimepoint==""] = 3;
> trainme$healthState[is.na(mPower$walking.training$medTimepoint)] = 3;
>
> trainme$healthState[mPower$walking.training$medTimepoint=="I don't take Parkinson medications"] = 5;
>
>
>
> submitme = trainme[,c(1,15)];
```
So if I could perfectly predict levels of PD, I could at best get to 0.917 - to me, this is the target.
I could try and do my own random forest predictions on my raw motion data features and just submit a single feature into Synapse "algorithm" box, and should attempt to get as close to 0.917 as possible. This is my optimizing thinking.
I believe the goal of this challenge is not custom prediction algorithms but rather custom motion-detection feature algorithm, so I will be focusing on preparing motion-features into the Synapse box with a target goal of 0.917 as a max.
Here is a 0/1 run:
```
trainme$isPD = NA;
trainme$isPD[mPower$walking.training$medTimepoint=="Immediately before Parkinson medication"] = 1;
trainme$isPD[mPower$walking.training$medTimepoint=="Another time"] = 1;
trainme$isPD[mPower$walking.training$medTimepoint=="Just after Parkinson medication (at your best)"] = 1;
trainme$isPD[mPower$walking.training$medTimepoint==""] = 0.5;
trainme$isPD[is.na(mPower$walking.training$medTimepoint)] = 0.5;
trainme$isPD[mPower$walking.training$medTimepoint=="I don't take Parkinson medications"] = 0;
submitme = trainme[,c(1,16)];
resultme = PD_score_challenge1(submitme);
str(resultme$error);
str(resultme$error);
'data.frame': 1 obs. of 7 variables:
$ parameter: Factor w/ 1 level "none": 1
$ ROC : num 0.917
$ Sens : num 0.988
$ Spec : num 0.826
$ ROCSD : num 0.00503
$ SensSD : num 0.00433
$ SpecSD : num 0.00728
```
Similar scores. This target aligns with the research articles you found. 0.6-0.8 would be a "good score" in my opinion. Remember, the testing data does not have medTimePoint ...
```
> str(mPower$walking.testing$medTimepoint)
NULL
```
So you could try and predict medTimePoint internally using your motion features and submit a single feature (which I do not believe is really what they want), or you can use your multiple motion features and submit expecting to get a result lower than 0.917. This helps establish the "gold standard" target, IMO.
ok. looks like we have a disagreement regarding the motivation behind all this. you could be right that i was biased.
so what do other participants think, regarding the recent changes, or information update? i welcome feedback from other participants to comment.
did it surprise you? such as only using 2 categories of 400 individuals for training instead of using 2800 individuals? As I mentioned Yuanfang, you have access to the exact list of recordIds being used to build the model, as well as the code for this purpose, so you should have all the information you need. If there's something else I can clarify please let me know.
Solly i could not find it anywhere written in the webinar slides and the wiki. you said it is there? which page? can you point out?
you said i have the code to access which age was filtered and which samples were filtered? can you point out which code and which line?
the code and the ids were released on 11 days before deadline, without a line of age filtering, or any kind of filtering.
i asked why it is only 400 for 3 times since 4 days ago, until now, 5 days before deadline i was explained the training set i was using was not correct.
i think it is intentionally done to exclude some participants.
@gustavo Yuanfang- It was in the webinar. Additionally, you have access to the exact list of filtered samples from the training data, and code, so you can see the models generated from your features.
Solly where is it written? That individual is 22 years old. We have set the lower age limit to 57. Otherwise, yes age would account for the vast majority of variance between cases and controls. for example, can you tell me why this individual is gone?
"10994" "5" "23312407-13f9-4aaa-89bf-54b7ce36bd3c" "022c800e-f8e7-4a9e-ba35-d6eb27ffa549" "1427314426000" "version 1.0, build 7" "iPhone 6" "2908367" "2908382" "2908398" "2908413" "2908427" "2908443" "2908459" "2908476" "I don't take Parkinson medications"
in corresponding demographic:
"2195" "2" "e7adcf85-4f74-426a-8ff7-d9f7a3c31462" "022c800e-f8e7-4a9e-ba35-d6eb27ffa549" "1427221563000" "version 1.0, build 7" "iPhone 6" "22" "false" "false" "2-year college degree" "Employment for wages" "Male" "true" "Single, never married" "true" "false" "false" "true" "false" """Black or African""" "Very easy" "false" "false" "true" are you sure?........
age has been 0.8-0.9 no matter record level or patient level since july...
but monte also got a 0.92...... using your code.
can you tell me how to go from 2500 to the 400 you got?
this is what i did:
grep don Walking_activity_training.tsv >aaa.txt
grep before Walking_activity_training.tsv >bbb.txt
cat aaa.txt bbb.txt |cut -f 4 |sort|uniq|wc
which gives me 2556 individuals.... instead of 400.
where column 4 is the
cut -f 4 Walking_activity_training.tsv |head, which is the healthcode.
"healthCode"
"639e8a78-3631-4231-bda1-c911c1b169e5"
"639e8a78-3631-4231-bda1-c911c1b169e5"
Yuanfang-
Age + gender give us AUROC of about 0.56 in the test data, so I suspect you're doing something wrong (e.g. not filtering or summarizing your samples as we are).
Solly actually i just started to do some literature search, which i should have done 3 month ago:
https://psb.stanford.edu/psb-online/proceedings/psb16/bellon.pdf
gives an auc of 0.65,
http://www.mdsabstracts.org/abstract/mpower-a-smartphone-based-study-of-parkinsons-disease-provides-personalized-measures-of-disease-impact/
gives 0.88 for gait, and 0.85 for turning
https://arxiv.org/pdf/1706.09574.pdf
from the organizing team. gives ~0.6-0.7 when using the method that is split in the challenge (i.e. no contamination between patient in training and test).
so i think very few or maybe hopefully no one could provide features that beat 0.92 baseline, so would the orgnizers consider taking out age/gender as combined features? it was great that no demographic was released for test set, that could actually make this possible. @MonteShaffer
i think you are only using age and gender, which gives you 0.918 auc.
because: 34631 obs. of 2 variables
actually, i have another question regarding this, has it been observed that adding some extracted feature it can beat 0.92? like in your dozen of pilot study? maybe my approach is completely off, actually i can't yet beat the age only baseline. So my question is, if no one can beat the age-only baseline, would it make more sense to use only submitted feature only for evaluation? of course, that is pertaining to the fact that my approach is not completely off, and actually it just cannot beat 0.92
/
so this is my question. there are only 178 + 275 valid training pids? i don't know why i get 2500 of them. Ok, we have not had this problem in our testing, but if you come up with another example that fails, please send me the features used so I can try to reproduce the issue. I am using your Rcode
one feature, healthState, succeeded:
```
> str(submitme)
'data.frame': 34631 obs. of 2 variables:
$ recordId : chr "704fda87-91c7-4c67-b520-b8e189f7f7f7" "eac28a30-63fd-445e-b109-1f948170033f" "e8d73f87-1c77-4bde-99f7-f2a0ec3c4350" "a88ca2fb-51d9-400f-ba7c-873725aac8b3" ...
$ healthState: num 5 5 5 5 5 5 5 5 5 5 ...
> resultme = PD_score_challenge1(submitme);
[1] "Reading and merging covariates"
[1] "Summarizing the data"
[1] "Fitting the model"
[1] "Run Control"
[1] 453 3
[1] 453
trainoutcome
Control PD
178 275
[1] "Caret Call"
note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
[1] "Ensemble run"
> str(resultme$error)
'data.frame': 1 obs. of 7 variables:
$ parameter: Factor w/ 1 level "none": 1
$ ROC : num 0.917
$ Sens : num 0.988
$ Spec : num 0.826
$ ROCSD : num 0.005
$ SensSD : num 0.00386
$ SpecSD : num 0.00726
```
one feature, rnorm ... it seemed to work this time.
```
> nObs = dim(trainme)[1];
> submitme$healthState = rnorm(nObs);
> str(submitme)
'data.frame': 34631 obs. of 2 variables:
$ recordId : chr "704fda87-91c7-4c67-b520-b8e189f7f7f7" "eac28a30-63fd-445e-b109-1f948170033f" "e8d73f87-1c77-4bde-99f7-f2a0ec3c4350" "a88ca2fb-51d9-400f-ba7c-873725aac8b3" ...
$ healthState: num 1.365 -0.247 -0.535 -1.209 -0.465 ...
> resultme = PD_score_challenge1(submitme);
[1] "Reading and merging covariates"
[1] "Summarizing the data"
[1] "Fitting the model"
[1] "Run Control"
[1] 453 3
[1] 453
trainoutcome
Control PD
178 275
[1] "Caret Call"
note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
maximum number of iterations reached 0.0037724319275848073 -0.0037434011914565524maximum number of iterations reached -0.00062532916598934118 0.00062544014285048632maximum number of iterations reached 0.0021238787540780457 -0.0021167183347590157maximum number of iterations reached 0.0021526348542518781 -0.0021476924846794976maximum number of iterations reached 0.0017198352185937016 -0.0017171571424356991maximum number of iterations reached -0.001064784970577437 0.0010653524192295594maximum number of iterations reached 0.0020033355304714395 -0.0020020442627506352maximum number of iterations reached 0.00068871561998790076 -0.0006885065829822512maximum number of iterations reached 0.0013266379758093971 -0.0013219517222464727maximum number of iterations reached -0.00092319052241074395 0.00092446282224079024maximum number of iterations reached 0.002680305666595606 -0.0026697804851409934maximum number of iterations reached 0.0023652721212529815 -0.0023557268675777898maximum number of iterati...
> str(resultme$error)
'data.frame': 1 obs. of 7 variables:
$ parameter: Factor w/ 1 level "none": 1
$ ROC : num 0.595
$ Sens : num 0.0806
$ Spec : num 0.938
$ ROCSD : num 0.00727
$ SensSD : num 0.0211
$ SpecSD : num 0.0167
>
```
That is very strange. We've run it with dozens of features both on my laptop or a server and not had a single problem. What kind of machine are you using? Could you send me a set of features that fails for you? sieberts [at] synapse [dot] org? Thanks! I ran it successfully with one "test" feature (my medtime variable).
I then ran it unsuccessfully with one "test" feature (rnorm variable).
I also tried it with three rnorm test features. (same error)
I have lots of features to try, but wanted to benchmark "gold" and "random" as upper and lower bounds. How many features are you trying to use @MonteShaffer?
Drop files to upload
Scoring code error: cannot allocate vector of size 8.9 Gb page is loading…