Hi,
As instructed from https://www.synapse.org/#!Synapse:syn8717496/wiki/466771, the submission has to use recordId as sample identification code. However, each person has as many recordId as records times, so that one sample will be regarded as several samples and will have multi prediction results. It looks unreasonable. The more reasonable one is using heathCode instead of recordId.
I tried the following code, and found all the recordId in the demo table can not be matched in the syn10233116 table. Is there any thing wrong?
```
library(synapseClient)
synapseLogin()
training_syntable <- synTableQuery("SELECT * FROM syn10146552")
traininginfo <- training_syntable@values
testing_syntable <- synTableQuery("SELECT * FROM syn10733842")
testinginfo <- testing_syntable@values
info_header = c("recordId", "healthCode")
local = rbind(traininginfo[,info_header], testinginfo[,info_header])
synid<-"syn10233116"
syndemos<-synGet(synid)
demos<-read.csv(attributes(syndemos)$filePath, header=T, as.is=T)
remote = demos[, c("recordId.walktest", "healthCode")]
match_recordID.i = match(local$recordId, remote$recordId.walktest)
sum(is.na(match_recordID.i))
sum(!is.na(match_recordID.i))
```
Created by Xinyu Zhang starrcofly @starrcofly-
This issue was discussed during the webinar. The supplementary training data are provided so you have access to later appVersions. Later appVersions (1.3 and beyond) have been filtered for the purposes of the primary scoring model due to differences in the way data were collected between versions <=1.2 and versions >= 1.3. However, these later appVersion observations will be used in the event a tie-breaker is necessary. I hope that clarifies the situation.
Solly All the recordIds in table syn10233116 can be found only in the first batch of training data, but none in the supplementary training data. >We found all the 6000 more recordId were mapped only in first training set,
what do you mean?
> all the recordId in the demo table can not be matched in the syn10233116 table.
what do you mean?
thanks Please read [this page](https://www.synapse.org/#!Synapse:syn8717496/wiki/466771) for submission requirements and to find the submission template. Your submission will fail if you don't include all recordIds in the template. Supplementary samples will be used to fit/score the tie-breaker models, if necessary, so it is in your best interest to do a good job on all samples.
Including additional recordIds will not cause an error, but including duplicate Ids or missing Ids will.
Regarding question 2, the answer is no. For primary scoring we have filtered the samples in an analogous manner, but additional samples will be used as a tie-breaker as necessary.
Solly
Hi Solveig,
Thank you for the reply. We found all the 6000 more recordId were mapped only in first training set, but none in supplementary training. Does it mean we don't submit features for any supplementary data?
Question two: for features of testing set, is there a table that is similar with syn10233116 to limit the recordIds for submission?
Question three: suppose we submit additional recordIds other than ones in table syn10233116, the scoring system will neglect it or stop due to NA values produced by unmapped recordIds?
Thanks. @starrcofly -
If you examine the code used to fit models on your features, you will find a step where we summarize across recordIds for each healthCode. This detail was announced in the challenge webinar.
There is nothing wrong with syn10233116. We have selected a subset of the healthCodes, based on age, and recordIds based on medTimepoint for the purposes of scoring. The remaining data are being collected in order that we use them for additional and exploratory analyses in the Collaborative Phase of the challenge.
Solly
Drop files to upload
healthCode is more reasonable than recordId as sample ID page is loading…