Why is time-series data as an unbalanced panel stored as json objects?
e.g.,
user-id activity-id timestamp attitude-x attitude-y attitude-z attitude-w rotation-x rotation-y rotation-z acceleration-x acceleration-y acceleration-z gravity-x gravity-y gravity-z magnetic-x magnetic-y magnetic-z magnetic-accuracy
Created by Monte Shaffer MonteShaffer the function synDownloadTableColumns downloads files in bulk in tar.gz format. These files are then unpacked into the cache once downloaded.
**(2) Am I correct that there are about this many json files.**
That is a good estimate of the upper bound.
**(3) Am I correct that the data consists of 2,864 unique health codes? This means "unique" users?**
I believe it is 2,834 unique users obtained with:
```
SELECT count(distinct healthCode) FROM syn10146553
```
**(4 and 5) So how do I bulk download the json files? You have some sort of limits in your SQL approach, which I don't need since you haven't parsed the data. When I tried to do "10 at a time with an SQL offset" ... I got this notice:**
The files are indeed downloaded in batch when you call synDownloadTableColumns. You do not need to use limit and offset to set your batch sizes.
I asked 5 questions and this is your response?
```
this data is not really tabular in nature.
```
timeseries data is inherently tabular or "panel" in nature. https://en.wikipedia.org/wiki/Panel_data
Here is my first few rows of a single "accel_walking_outbound"
```
> head(ftemp);
timestamp x y z
1 0.04992573 -0.014365210 0.012270284 -0.03202245
2 0.15976273 -0.066439096 -0.002094926 -0.03097498
3 0.26959491 0.086041622 0.047285483 -0.03217208
4 0.37942900 0.002543839 -0.003292027 0.06883330
5 0.48926000 0.050128597 0.017956512 -0.01301847
6 0.59910318 0.062398881 0.009876082 -0.02289455
```
Looks pretty tabular to me. The nature of the data is unbalanced, so it is relational meaning all of the data cannot go into a single table, but it would be very possible for all of the json data to go into 2-3 "panel" tables as I initially observed in this thread.
The example code in your response only shows getting "device_outbound". The example code below shows how to possibly grab it all, but may take 97 hours. how do I download the bulk json data? If I use SQL, every time the data is sent uncompressed. If I get a tar.gz of the json files with your arbitrary caching system, I can extract to my cache folder and run.
I have been reading your code in mPower, and am trying to think for myself what characteristics can I extract from this raw data. Kurtosis is a statistical artifact, but IMO, not a characteristic. I will review how you process the data, determine medians in your mpower library, but end of the day, how I deal with signal processing and data organization will be based on my intuition.
If I get one answer from you, it is: **how do I download the bulk json data?** The SQL approach, at its best, according to github will take me 97 hours. If this were 1997, that would be fine. But twenty years later, that is not cool.
```
require(synapseClient)
require(rjson)
synapseLogin()
synapseCacheDir("/tmp/.synapseCache")
## QUERY THE mPower PROJECT (syn4993293) FOR ALL OF THE TABLES
q <- synQuery('SELECT id, name FROM table WHERE parentId=="syn4993293"')
## SELECT ONLY THE SAMPLE TABLES, WHICH ARE PUBLICALLY AVAILABLE
q <- q[grep("Activity", q$table.name), ]
## READ IN THE SAMPLE DATA FROM EACH OF THE TABLES (RESULTS ARE CACHED LOCALLY)
## Corresponding to walking, voice, tapping and memory tables from synapse project website
## under four separate folders, each folder contain one csv file
## not individual json files
allDat <- lapply(as.list(q$table.id), function(x){
synTableQuery(paste0('SELECT * FROM ', x))
})
names(allDat) <- q$table.name
## walking activity data
map <- synDownloadTableColumns(allDat$`Walking Activity`, c("accel_walking_outbound.json.items", "deviceMotion_walking_outbound.json.items", "pedometer_walking_outbound.json.items", "accel_walking_return.json.items", "deviceMotion_walking_return.json.items", "pedometer_walking_return.json.items", "accel_walking_rest.json.items", "deviceMotion_walking_rest.json.items"))
## THE NAMES OF tapMap ARE THE FILEHANDLES STORED IN THE COLUMN SO CAN ASSOCIATE WITH APPROPRIATE METADATA
## THESE ARE JSON FILES, SO READ THEM INTO MEMORY
results <- lapply(as.list(map), function(x){
fromJSON(file=x)
})
https://sagebionetworks.jira.com/browse/SYNR-1024
The entire download took 97 hours. Thought longer than estimated time, it ran smoothly without a single error ? a huge improvement over our previous test. I am pretty happy with it.
```
Anyway, I can play around with toy examples to do signal analysis, but for the challenge, I will need all of the training data.
**how do I download the bulk json data?** preferably in a .tar.gz format [https://www.howtogeek.com/248780/how-to-compress-and-extract-files-using-the-tar-command-on-linux/] - for a big data download, is this an unreasonable request? It seems odd that MJFF, the pillar of accessibility to data would intentionally make this so difficult?
@MonteShaffer, this data is not really tabular in nature. You will have much more success using the R synapseclient. As described in the [wiki](#!Synapse:syn8717496/wiki/448349). Or just use the the code that downloads all files and creates toy features as a starting point:
https://github.com/Sage-Bionetworks/PDBiomarkerChallenge/blob/master/simplified_walk.py
Best,
Larsson
I am a bit baffled by this entire setup.
(1) Is this a challenge to perform analyses or parse data?
I downloaded, via the web, the main tables, then imported into R:
```
> str(walking)
'data.frame': 34631 obs. of 16 variables:
$ ROW_ID : int 0 1 2 3 4 5 6 7 8 9 ...
$ ROW_VERSION : int 0 0 0 0 0 0 0 0 0 0 ...
$ recordId : Factor w/ 34631 levels "00028e9c-adfb-4d0f-b669-ac02e5ca4cf6",..: 15158 31699 31450 22732 14006 20918 25685 29747 30900 11489 ...
$ healthCode : Factor w/ 2834 levels "000240d1-1110-4dd2-a2d0-e344c37efd68",..: 1129 1129 1129 1173 1889 290 580 1889 1889 290 ...
$ createdOn : num 1.43e+12 1.43e+12 1.43e+12 1.43e+12 1.43e+12 ...
$ appVersion : Factor w/ 3 levels "version 1.0, build 7",..: 1 1 1 1 1 1 1 1 1 1 ...
$ phoneInfo : Factor w/ 9 levels "iPhone 4S","iPhone 5 (GSM)",..: 7 7 7 8 8 5 7 8 8 5 ...
$ accel_walking_outbound.json.items : int 2324645 2324168 2326183 2341901 2354841 2521131 2367730 2369834 2324475 2516040 ...
$ deviceMotion_walking_outbound.json.items: int 2324646 2324169 2326184 2341902 2354842 2521145 2367731 2369835 2324476 2516058 ...
$ pedometer_walking_outbound.json.items : int 2324647 2324170 2326185 2341903 2354843 2521160 2367732 2369836 2324477 2516067 ...
$ accel_walking_return.json.items : int 2324648 2324171 2326186 2341904 2354844 2521176 2367733 2369837 2324478 2516086 ...
$ deviceMotion_walking_return.json.items : int 2324649 2324172 2326187 2341905 2354846 2521190 2367734 2369838 2324479 2516103 ...
$ pedometer_walking_return.json.items : int 2324650 2324173 2326188 2341906 2354847 2521212 2367735 2369839 2324480 2516118 ...
$ accel_walking_rest.json.items : int 2324651 2324174 2326190 2341908 2354848 2521227 2367736 2369840 2324481 2516133 ...
$ deviceMotion_walking_rest.json.items : int 2324652 2324175 2326191 2341909 2354849 2521242 2367737 2369841 2324482 2516148 ...
$ medTimepoint : Factor w/ 5 levels "","Another time",..: 3 3 3 3 3 3 3 3 3 3 ...
> str(demographics);
'data.frame': 2864 obs. of 33 variables:
$ ROW_ID : int 0 1 2 3 4 5 6 7 8 9 ...
$ ROW_VERSION : int 0 0 0 0 0 0 0 0 0 0 ...
$ recordId : Factor w/ 2864 levels "00304e29-11e3-4293-9e70-3abdeb024917",..: 2121 64 2018 158 1240 2013 1072 1409 583 1464 ...
$ healthCode : Factor w/ 2864 levels "000240d1-1110-4dd2-a2d0-e344c37efd68",..: 1141 941 1185 1973 587 731 847 295 1962 2320 ...
$ createdOn : num 1.43e+12 1.43e+12 1.43e+12 1.43e+12 1.43e+12 ...
$ appVersion : Factor w/ 3 levels "version 1.0, build 7",..: 1 1 1 1 1 1 1 1 1 1 ...
$ phoneInfo : Factor w/ 9 levels "iPhone 4S","iPhone 5 (GSM)",..: 7 5 8 7 7 8 7 5 8 7 ...
$ age : int 24 38 32 24 35 22 26 37 38 43 ...
$ are.caretaker : Factor w/ 3 levels "","false","true": 2 2 2 2 2 2 2 2 2 2 ...
$ deep.brain.stimulation : Factor w/ 3 levels "","false","true": 2 2 2 2 2 2 2 2 1 2 ...
$ diagnosis.year : int NA NA NA NA NA NA NA NA NA NA ...
$ education : Factor w/ 9 levels "","2-year college degree",..: 7 8 2 7 4 3 4 3 5 7 ...
$ employment : Factor w/ 8 levels "","A homemaker",..: 4 4 3 4 4 3 3 4 4 4 ...
$ gender : Factor w/ 4 levels "","Female","Male",..: 3 2 3 3 3 3 3 3 3 3 ...
$ health.history : Factor w/ 605 levels "","\"Acute Myocardial Infarction/Heart Attack\"",..: 258 417 1 84 1 1 1 1 1 588 ...
$ healthcare.provider : Factor w/ 7 levels "","Don't know",..: 2 1 1 5 1 1 1 1 1 1 ...
$ home.usage : Factor w/ 3 levels "","false","true": 3 3 3 3 3 3 3 3 3 3 ...
$ last.smoked : int NA NA 2009 2014 NA 2015 NA NA NA NA ...
$ maritalStatus : Factor w/ 7 levels "","Divorced",..: 6 3 6 3 3 6 6 3 3 3 ...
$ medical.usage : Factor w/ 3 levels "","false","true": 3 3 3 3 3 3 3 3 3 3 ...
$ medical.usage.yesterday: Factor w/ 4 levels "","dont-know",..: 2 3 3 3 4 3 4 3 3 4 ...
$ medication.start.year : int NA NA NA NA NA 0 NA 0 NA NA ...
$ onset.year : int NA NA NA NA NA NA NA NA NA NA ...
$ packs.per.day : int NA NA 1 NA NA 1 NA NA NA NA ...
$ past.participation : Factor w/ 3 levels "","false","true": 2 2 2 2 2 2 2 2 2 2 ...
$ phone.usage : Factor w/ 4 levels "","false","Not sure",..: 4 4 4 4 4 4 4 4 4 4 ...
$ professional.diagnosis : Factor w/ 3 levels "","false","true": 2 2 2 2 2 2 2 2 2 2 ...
$ race : Factor w/ 52 levels "","\"Black or African\"",..: 46 46 46 18 46 46 46 46 18 46 ...
$ smartphone : Factor w/ 6 levels "","Difficult",..: 6 6 6 6 6 6 6 6 6 6 ...
$ smoked : Factor w/ 3 levels "","false","true": 2 2 3 3 2 3 2 2 2 2 ...
$ surgery : Factor w/ 3 levels "","false","true": 2 1 1 2 2 2 2 1 1 2 ...
$ video.usage : Factor w/ 3 levels "","false","true": 2 3 3 3 3 3 3 3 3 3 ...
$ years.smoking : int NA NA 8 0 NA 3 NA NA NA NA ...
>
```
From this, according to my understanding, there are 8 types of json files per "recordId"
```
jsonlist = c("accel_walking_outbound.json.items","deviceMotion_walking_outbound.json.items","pedometer_walking_outbound.json.items","accel_walking_return.json.items","deviceMotion_walking_return.json.items","pedometer_walking_return.json.items","accel_walking_rest.json.items","deviceMotion_walking_rest.json.items");
```
If I am correct, that means there are, at most 277,048 json files (8 * 34631).
(2) Am I correct that there are about this many json files.
(3) Am I correct that the data consists of 2,864 unique health codes? This means "unique" users?
(4) So how do I bulk download the json files? You have some sort of limits in your SQL approach, which I don't need since you haven't parsed the data. When I tried to do "10 at a time with an SQL offset" ... I got this notice:
```
10 total files requested: At least 0 are downloaded; 10 will be retrieved in the next batch.
............................................................
```
YES, I have read your limited data dictionary.
I have participated in previous Challenges, including the original NETFLIX challenge. It is pretty normal that the "raw data" is tabularized.
I am capable of tabularizing it myself, if I can bulk download the 277,048 json files.
(5) Is this possible to bulk download? How?
If you have an interest in improving your data accessibility, you may want to see PatentsView:
http://www.patentsview.org/api/doc.html
http://www.patentsview.org/query/
http://www.patentsview.org/download/
Sure, I get you need to secure the data. Sure, I get that I can work on a small sample to get my signal-processing elements in place. But no, this is not cool to make parsing a prerequisite to the analysis. IMO, the current situation is a step backward from the former 'kaggle' solution.
This is an artifact of how data is collected and stored in Apple Researchkit.