The PD_score_challenge1 notes an unbalanced classficiation in the "training" phase: ``` Control PD 178 275 ``` I earlier suggested the gold standard was about `0.91734137545722882` based on this setup. If you do some simple math ``` nobs = 275 + 178; # 453 tobs = 275; # PD=true tobs / nobs; # probability [1] 0.607064 ``` So the baseline is this probability (0.607064) if we are utilizing this function to improve our scores. I went off on a tangent trying to create my own random forests and ensembles, but end of the day this will be the scoring algorithm. #1# Is that correct, the scoring will occur with the same ensemble, just will a different data set? #2# Will the unbalanced classification be about the same, or will change? I am not gaming to this with my features, but it would be useful to understand if "unbalanced" will still be the operandi - "Caret Call" is triggering a notice based on this assumption, and I want to make certain this isn't changing with the final scoring. `note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .` On another topic, I moved my data to a non-windows server to take advantage of multicores. I believe it is improving the overall prediction time by about 150% over my single-instance; however, I did not notice any significant difference between assigning 16 or 22 (of 24) cores.

Created by Monte Shaffer MonteShaffer
I tried: ``` submitme = subset(submitme, isPD==1 | isPD==0); table(submitme$isPD); > table(submitme$isPD) 0 1 12122 22435 resultme = PD_score_challenge1(submitme); ``` And got `0.91729675015029688`
Hi @yuanfang.guan , I am treating their scoring algorithm like a black box. I take their raw data from `walking.training` and put into their black box: ``` trainme = codeHealthState(mPower$walking.training); > submitme = trainme[,c(1,16)]; # one variable $isPD > str(submitme) 'data.frame': 34631 obs. of 2 variables: $ recordId: chr "704fda87-91c7-4c67-b520-b8e189f7f7f7" "eac28a30-63fd-445e-b109-1f948170033f" "e8d73f87-1c77-4bde-99f7-f2a0ec3c4350" "a88ca2fb-51d9-400f-ba7c-873725aac8b3" ... $ isPD : num 0 0 0 0 0 0 0 0 0 0 ... ``` I pass this into the black box ``` > resultme = PD_score_challenge1(submitme); [1] "Reading and merging covariates" [1] "Summarizing the data" [1] "Fitting the model" [1] "Run Control" [1] 453 3 [1] 453 trainoutcome Control PD 178 275 [1] "Caret Call" note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 . [1] "Ensemble run" ``` And not the ROC value: ``` print(resultme$error$ROC); # 0.91734137545722882 ``` As a participant, I also may be way off base. I took their ideal data, put it in the black box, and got this value. It is not 1, IMO because of the: 74 or so NA on medTime, median collapsing of replicate records for a unique healthCode, unbalanced ensemble classificiations, and possible other issues. ``` > table(submitme$isPD) 0 0.5 1 12122 74 22435 ```
Sorry, I hadn't had my morning coffee. I, of course, meant to say "testing" throughout. (1) The models generated will be evaluated in the testing portion of the data. (2) We cannot reveal any information about the labels in the testing data.
monte and organizer, i really don't understand your discussion, maybe i understand this challenge completely wrong. >I earlier suggested the gold standard was about 0.91734137545722882 based on this setup. why do you think if you print the gold standard you don't get a 1? >The models generated will be evaluated in the training portion of the data. can you tell me if it is evaluated in the training part then why there is a test set? >We cannot reveal any information about the labels in the training data. isn't the label just professional diagnosis which is available in the demographic files? thanks for the clarification. yuanfang
Monte, (1) The models generated will be evaluated in the training portion of the data. (2) We cannot reveal any information about the labels in the training data. Solly

Benchmark Info page is loading…