Hello,
I am trying to extract raw gene and isoform counts from the bulk RNAseq data in the AMP-AD Diverse Cohorts study from [Mayo + Emory](syn51735458). I want to process the data in the same way as was done for the [RNAseq Harmonization study](syn26720675). The README suggests that it is possible to use the same workflow by running it using `cwltool`.
I am trying to setup a job for processing the bam files in the AMP-AD Diverse Cohorts study using the job used for [Mayo](https://github.com/titoghose/amp-workflows/tree/develop/amp-rnaseq_reprocessing/amp-rnaseq_reprocess-workflow/jobs/mayo) as part of the RNAseq harmonization project as a template. The [job.json](https://github.com/titoghose/amp-workflows/blob/develop/amp-rnaseq_reprocessing/amp-rnaseq_reprocess-workflow/jobs/rosmap/job.json) file contains `index_synapseid` and `synapse_parentid` that I don't seem to have read access to, and hence, have not been able to figure out what they are. I think the `index_synapseid` maybe the genome reference and annotation files required by the alignment tools, but I am not sure. I understand that the `synapseid` is an array of ids of the bam files to be processed, but without understanding what the other two ids are, I am unable to proceed.
Snippet of the job.json file copied from the public github repository:
```
{
"index_synapseid": "syn20645801",
"nthreads": 15,
"synapse_config": {
"class": "File",
"path": "/etc/synapse/.synapseConfig"
},
"synapse_parentid": "syn20825471",
"synapseid": [
"syn4898414",
"syn4899441",
......
```
I was wondering if someone can explain to me what these files/folders are and how I can modify the workflow to process the AMP-AD Diverse Cohorts data. I've also noticed that the [bam files for Mayo + Emory in AMP-AD Diverse Cohorts](syn51735458) are lane split (L1-L4). Does this mean that an additional step needs to be added somewhere in [wf-alignment.cwl](https://github.com/titoghose/amp-workflows/blob/develop/amp-rnaseq_reprocessing/amp-rnaseq_reprocess-workflow/wf-alignment.cwl) to merge them?
Best,
Upamanyu
Created by Upamanyu Ghose ug96 Hi @ug96,
Good catch, it looks like the RNASeq portion of the DivCo harmonization study hasn't officially been published on the portal yet which is why you don't have access. We have run into some technical issues with the residual matrices, however I will push to have all other matrices (raw and normalized counts, CQN matrices etc.) published ASAP (hopefully in the next data release in a few weeks). The AMP 1.0 data might fall a little bit behind this compared to the DivCo data, but I will also push to have at least the raw counts published as soon as possible. I will let you know when I have a more tangible time frame for that. I also want to note that for AMP 1.0, the WGS reprocessing will be released significantly later (most likely in ~6 months) as I haven't begun reprocessing those files just yet.
Hope this helps!
Will Hello,
I was wondering if it would be possible to gain access to the harmonised rnaseq between DivCo and AMP 1.0? I had a look at the files and resources and noticed a project called DivCo\_HS ([syn68898203](syn68898203)) which contains the joint call VCFs for the WGS data, but the harmonised RNAseq data (raw counts) is not available. The wiki for the DivCo\_HS project states:
"_..... In addition, we are working on an RNASeq harmonization that can also be uploaded to this study. The RNASeq harmonization will contribute raw RNA expression counts as well as Residual matrices and QC results ...._"
Since you mentioned in your previous message that the resource is now available at [https://www.synapse.org/Synapse:syn69051372](https://www.synapse.org/Synapse:syn69051372), I wanted to check if it is just an issue on my end that I cannot access the data or if it yet to be made public.
Thanks very much for your help.
Best,
Upamanyu
Hello @wpoehlm,
Thank you very much for your response! I do want to have the same RNAseq pipelines for DivCo and the RNAseq Harmonization project (which I am guessing is AMP 1.0) because I want to work on a combined cohort of the ROSMAP data from the RNAseq harmonization project and Mayo Clinic data from the DivCo project. My requirement is the called variants from WGS/SNP array and the raw counts from the RNAseq, so the resource you mention would be perfect.
I tried accessing the DivCo processed data using the link you shared, but I don't seem to have access to the resource. I was wondering if I need to apply separately to access this data. I have access to all other studies and projects under [syn2580853](syn2580853).
Additionally, having access to the AMP 1.0 and DIvCo data processed with the same nextflow pipeline would be perfect. Is there an estimated timeline for the next data release?
Thank you very much in advance!
Best,
Upamanyu Hello @ug96,
Thanks for reaching out. I can clarify the following to help you get the workflow up and running:
- "index_synapseid" is indeed a folder that contains indexed reference genome files. We should ideally be sharing those publicly or providing documentation on how to generate them. In the meantime, I've added you to the syn20645801 folder, so you can access those now
- "synapse_parentid" is a synid that corresponds to the folder that you would like to output the results files to (and the workflow will do this automatically). I would recommend setting this to a private folder that you have access to
- Unfortunately, I don't believe that this iteration of the workflow supports merging across lanes, so you are correct that additional functionality would need to be added, or the reads need to be merged before running.
At a higher level, I want to note that we have recently reprocessed all of the AMP 1.0 and Diverse Cohorts RNASeq samples using a newer Nextflow pipeline. If your goal here is to harmonize the data between AMP 1.0 and DivCo, you likely don't need to reprocess anything (at least to get to gene counts). You can find the DivCo reprocessing here: https://www.synapse.org/Synapse:syn69051372. The (re)-reprocessed AMP 1.0 data is not yet published on the portal, but I can push to have it in the next data release.
Best,
Will
Drop files to upload
Applying AMP RNAseq reprocessing workflow to AMP-AD Diverse Cohorts data page is loading…