Hello!
I hope this message finds you well!
First, I'd like to thank you for your hard work on this site! It is very much appreciated! Secondly, thank you for the quick authorization!
I am writing with hopes that you may be able to provide me with some clarifications as I attempt to download H&E, clinical data, exome, and RNAseq for the MPNST samples in https://www.synapse.org/#!Synapse:syn4939902/wiki/235907 on behalf of the GeM consortium. Most importantly, I need to download and reprocess exome and RNAseq data.
I am writing to detail some inconsistencies I was hoping you may be able to sort out.
To start, I have attempted to download raw RNA-seq data.
First, I was hoping you could clarify the following issue with the metadata. Looking at this table: https://www.synapse.org/#!Synapse:syn4939902/wiki/593716, in particular, the following line: Malignant Peripheral Nerve Sheath Tumor false geneExpression 16 6, indicates 6 "samples" and 16 "patients" -- are there generally not either a greater or equal number of samples with respect to the number of patients when dealing with human subjects? Would you be able to provide some insight on this issue?
**Additional issues I encountered but have been able to circumvent:**
1. Navigating to the "file view" mentioned in https://www.synapse.org/#!Synapse:syn4939902/wiki/593715, I only see 493 RNA-seq data files. However, navigating directly to https://www.synapse.org/#!Synapse:syn13363852/, there are 667 RNA-seq data files. Would you be able to clarify why you have chosen this view? I personally found this confusing.
2. When attempting to download raw RNAseq data, when I clicked "Add to Download List", under "Download Options" in https://www.synapse.org/#!Synapse:syn13363874, and proceed to attempt to download the files by clicking on "Click to view items in your download list", I encountered a notice that reads "You must request access to this restricted file", under "Access". Is there a reason I am not able to access these files via this view, but am presumably able to download them via an alternative view?
3. Finally, please note that the link associated with the link in this statement: "To view the data files, please navigate to the file view that lists all the files and their associated metadata." redirects to the synapse home page.
Thanks for your help,
Alon
Created by Alon Galor alongalor Thank you, @mkai1 , @sgosline and @jineta.banerjee for your assistance!
We very much appreciate it.
Very best,
Alon Hello,
@sgosline I did not convert these files to fastq. I mostly worked with the VCF files for these samples released by the university core, and in a couple of occasions converted the bams to hdf5 counts files using GATK pipelines.
Thanks,
Jineta Hi all,
@jineta.banerjee didn't you convert these to fastq at some point? Or am I thinking of a different dataset?
-sara Hello @alongalor I am looking into getting the original data files uploaded for you.
Kai Thank you for your message, @allawayr.
@mkai1 it would be great to hear your thoughts and if you may be able to remedy this by uploading the original FASTQs, as Robert has suggested.
Thanks in advance for your help!
Alon
Hi Alon,
Thanks for following up. We can definitely ask the data contributors to look into this. Specifically, I'll tag @mkai1 and @Christine_Pratilas to point them to the [discussion thread](https://www.synapse.org/#!Synapse:syn4939902/discussion/threadId=7038&replyId=22491) to get their feedback.
@mkai1: can you comment on the issue that Alon is describing [here](https://www.synapse.org/#!Synapse:syn4939902/discussion/threadId=7038&replyId=22491)? A brief google search suggests that some prefiltering steps can cause a mismatch in pairs that will cause this error to be thrown by Picard. Would it be possible to instead provide the original fastqs for [these four exomeSeq bam files](https://www.synapse.org/#!Synapse:syn13363852/tables/query/eyJzcWwiOiJTRUxFQ1QgKiBGUk9NIHN5bjEzMzYzODUyIFdIRVJFICggKCBcImFzc2F5XCIgPSAnZXhvbWVTZXEnIE9SIFwiYXNzYXlcIiA9ICdybmFTZXEnICkgQU5EICggXCJpc0NlbGxMaW5lXCIgPSAnZmFsc2UnICkgQU5EICggXCJmaWxlRm9ybWF0XCIgPSAnYmFtJyApIEFORCAoIFwidHVtb3JUeXBlXCIgPSAnTWFsaWduYW50IFBlcmlwaGVyYWwgTmVydmUgU2hlYXRoIFR1bW9yJyApICkgQU5EIChcInRyYW5zcGxhbnRhdGlvblR5cGVcIiBpcyBudWxsKSBBTkQgYWNjZXNzVHlwZSBpbiAoJ1BVQkxJQycsJ1JFUVVFU1QgQUNDRVNTJykiLCAiaW5jbHVkZUVudGl0eUV0YWciOnRydWUsICJpc0NvbnNpc3RlbnQiOnRydWUsICJvZmZzZXQiOjAsICJsaW1pdCI6MjV9)?
(as an aside, if this data can be provided here, would it be possible to provide fastqs instead for the exomeSeq data moving forward as well?)
Thanks and have a great weekend all,
Robert
Hi @allawayr,
Thanks a lot for your help and for providing this very useful query! Your help is very much appreciated!
I am writing in hopes that you might be able to help with an additional issue:
I was able to download the data provided in this query. However, I experienced an issue re-processing the 4 BAM files present in the query.
The error I encounter can be reproduced as follows:
1. Convert the 4 BAMs to FASTQ files as such (command for one representative BAM file shown (paths shortened for brevity)):
```
samtools fastq -1 136810-1163242037_R1.fastq.gz -2 136810-1163242037_R2.fastq.gz 136810-1163242037.bam
```
This command runs without issue. The standard output is as follows:
```
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 44073468 reads
```
2. Next, attempt to convert FASTQs to uBAMs as such (command for one representative pair of FASTQ files shown (paths shortened for brevity)):
```
java -Xmx8G -Djava.io.tmpdir=tmp -jar picard.jar FastqToSam \
FASTQ=136810-1163242037_R1.fastq.gz \
FASTQ2=136810-1163242037_R2.fastq.gz \
OUTPUT=136810-1163242037.bam \
READ_GROUP_NAME=136810-1163242037_RG \
SAMPLE_NAME=136810-1163242037 \
LIBRARY_NAME=lib_name \
PLATFORM=illumina \
TMP_DIR=tmp
```
These commands error out with the following written to standard output (full paths shown):
```
13:00:53.275 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/n/data1/hms/dbmi/park/alon/software/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri May 15 13:00:53 EDT 2020] FastqToSam FASTQ=/n/data1/hms/dbmi/park/DATA/BCH-GeM-New/Christine_Pratilas_2018_2021/WES/.SamtoolsFastq/136810-1163242037_R1.fastq.gz FASTQ2=/n/data1/hms/dbmi/park/DATA/BCH-GeM-New/Christine_Pratilas_2018_2021/WES/.SamtoolsFastq/136810-1163242037_R2.fastq.gz OUTPUT=/n/data1/hms/dbmi/park/DATA/BCH-GeM-New/Christine_Pratilas_2018_2021/WES/FastqToSam/136810-1163242037.bam READ_GROUP_NAME=136810-1163242037_RG SAMPLE_NAME=136810-1163242037 LIBRARY_NAME=lib_name PLATFORM=illumina TMP_DIR=[/n/data1/hms/dbmi/park/DATA/BCH-GeM-New/Christine_Pratilas_2018_2021/WES/FastqToSam/.sh/tmp] USE_SEQUENTIAL_FASTQS=false SORT_ORDER=queryname MIN_Q=0 MAX_Q=93 STRIP_UNPAIRED_MATE_NUMBER=false ALLOW_AND_IGNORE_EMPTY_LINES=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Fri May 15 13:00:53 EDT 2020] Executing as ag457@compute-p-17-38.o2.rc.hms.harvard.edu on Linux 3.10.0-327.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.3-SNAPSHOT
INFO 2020-05-15 13:00:53 FastqToSam Auto-detected quality format as: Standard.
[Fri May 15 13:00:53 EDT 2020] picard.sam.FastqToSam done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2058354688
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" picard.PicardException: In paired mode, read name 1 (180918_CIDR4_0409_ACCC52ANXX:1:1215:16106:64840) does not match read name 2 (180918_CIDR4_0409_ACCC52ANXX:1:2313:11655:80978)
at picard.sam.FastqToSam.getBaseName(FastqToSam.java:511)
at picard.sam.FastqToSam.doPaired(FastqToSam.java:403)
at picard.sam.FastqToSam.makeItSo(FastqToSam.java:374)
at picard.sam.FastqToSam.doWork(FastqToSam.java:347)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
```
I have processed tens of thousands of FASTQ/BAM files using the standard GATK Best Practices workflow, which the majority of the community uses, over a period of more than two years with only a single issue. Coincidentally, this single issue was created by exactly the same error as encountered here.
It would be great if a more readily useable format be uploaded, so that our group and others around the world can analyze them without issue.
Thanks for your help and very best,
Alon Hi Alon,
Thank you for your patience! I fixed the specimen id issue last week but noticed another error that was causing misleading counts of xenografts and not-xenografts (i.e. some files were missing this boolean) in the summary table you were looking at. I've fixed these issues now. There are currently 4 public specimens that fit your criteria - MPNST, not cell line, not xenograft. For these 4 there should be both gene expression and genomic variant data. To see all of the files, please use this query:
```
SELECT * FROM syn13363852 WHERE ( ( "assay" = 'exomeSeq' OR "assay" = 'rnaSeq' ) AND ( "isCellLine" = 'false' ) AND ( "fileFormat" = 'cram' OR "fileFormat" = 'fastq' OR "fileFormat" = 'bam' ) AND ( "tumorType" = 'Malignant Peripheral Nerve Sheath Tumor' ) ) AND ("transplantationType" is null) AND accessType in ('PUBLIC','REQUEST ACCESS')
```
Here's a direct link to the query: https://www.synapse.org/#!Synapse:syn13363852/tables/query/eyJzcWwiOiJTRUxFQ1QgKiBGUk9NIHN5bjEzMzYzODUyIFdIRVJFICggKCBcImFzc2F5XCIgPSAnZXhvbWVTZXEnIE9SIFwiYXNzYXlcIiA9ICdybmFTZXEnICkgQU5EICggXCJpc0NlbGxMaW5lXCIgPSAnZmFsc2UnICkgQU5EICggXCJmaWxlRm9ybWF0XCIgPSAnY3JhbScgT1IgXCJmaWxlRm9ybWF0XCIgPSAnZmFzdHEnIE9SIFwiZmlsZUZvcm1hdFwiID0gJ2JhbScgKSBBTkQgKCBcInR1bW9yVHlwZVwiID0gJ01hbGlnbmFudCBQZXJpcGhlcmFsIE5lcnZlIFNoZWF0aCBUdW1vcicgKSApIEFORCAoXCJ0cmFuc3BsYW50YXRpb25UeXBlXCIgaXMgbnVsbCkgQU5EIGFjY2Vzc1R5cGUgaW4gKCdQVUJMSUMnLCdSRVFVRVNUIEFDQ0VTUycpIiwgImluY2x1ZGVFbnRpdHlFdGFnIjp0cnVlLCAiaXNDb25zaXN0ZW50Ijp0cnVlLCAib2Zmc2V0IjowLCAibGltaXQiOjI1fQ==
Or, install our command line, Python or R client to download the file: https://docs.synapse.org/articles/getting_started_clients.html
Please do not hesitate to ask if you have additional questions or identify additional confusing metadata.
Thanks!
Cheers,
Robert
Hi @allawayr,
Hope you had a nice weekend!
Thanks again for all your help last week! Just wanted to check in on these remaining items and see if there is anything else I can provide on my end?
Very much appreciated,
Alon Hi @allawayr,
Thanks for your reply! And no worries - I appreciate you being very accommodating and your efforts to make modifications!
>Here's the correct link: https://nf.synapse.org/Explore/Studies/DetailsPage?studyId=syn4939902
Thanks. This definitely looks very visually appealing :)
>I did not read closely enough. I was only considering RNASeq data. The Exomeseq component of this release, including bams, is here: https://www.synapse.org/#!Synapse:syn19226800
Ah ok! Thank you.
>Once I've fixed the previously mentioned metadata, I'll send you a fileview query that should contain all of the public RNASeq and ExomeSeq bam,cram,and fastqs from MPNSTs that are not from xenografts.
That would be a huge time saver! It would be amazing if I could be able to quickly add all those files to the download list as well. Thanks a lot!
>Interesting. I am aware of that, but the majority of the rnaseq data we receive is concatenated into 2 files, even if it is run on many lanes.
Great! That is the way it should be!
>Your point is well taken though, maybe we can figure out a way to better convey this information.
Fantastic.
Have a great weekend,
Alon Hi Alon,
No problem; apologies that it was necessary to reach out in the first place!
>... https://nf.synapse.org/Explore/Studies/DetailsPage?studyId=syn4939902I
>No worries, but this does not load for me.
Oops, somehow added an "l" to the end of the url....I'll blame my new keyboard :)
Here's the correct link: https://nf.synapse.org/Explore/Studies/DetailsPage?studyId=syn4939902
>Ok. if this folder has “[All] the data that are available”, is there a reason it not contain the BAMs yielded by the following operation I previously described (shown immediately below)?
>In this case, applying the following filters to https://www.synapse.org/#!Synapse:syn13363852/: assay=exomeSeq, isCellLine=False, tumorType=Malignant Peripheral Nerve Sheath Tumor, fileFormat=bam, cram, and discarding rows that appear to have an association with a xenograft, yields 6 unique specimenIDs, and 7 samples -- consistent with the "Current Data" table.
I did not read closely enough. I was only considering RNASeq data. The Exomeseq component of this release, including bams, is here: https://www.synapse.org/#!Synapse:syn19226800
Once I've fixed the previously mentioned metadata, I'll send you a fileview query that should contain all of the public RNASeq and ExomeSeq bam,cram,and fastqs from MPNSTs that are not from xenografts.
>Ok. Thank you for clarifying - this is important to know and not obvious as often a single sample will be comprised of more than 2 FASTQ files, especially if the sample was sequenced across multiple lanes.
Interesting. I am aware of that, but the majority of the rnaseq data we receive is concatenated into 2 files, even if it is run on many lanes. There may be a couple exceptions across the NF datasets we house, but we generally try to provide the data as R1 and R2 paired files (as most pipelines requires that). Your point is well taken though, maybe we can figure out a way to better convey this information.
Alon Hi @allawayr,
Thanks a lot for your swift response and for carefully addressing each of my concerns!
>You're correct, there should be more patients than samples. Looking at the metadata, it appears that some of the data were deposited with sample ids as patient IDs, which is causing the error in counts.
I can try and correct this by the end of the day PT today, but probably not much sooner. Apologies for the error!
Thanks! Looking forward to hearing from you.
>I do also want to note that we are spending considerable effort developing the NF Data Portal to better standardize these views across all studies, and to make finding and access data more easy. You can view the current portal page for this study here: https://nf.synapse.org/Explore/Studies/DetailsPage?studyId=syn4939902I
No worries, but this does not load for me.
>also want to point you to a sneak peek (i.e. alpha mode) explore tool our portal team has been working on that may allow you to better search this and other data across the NF-OSI studies. For example, if you want to see all MPNST fastqs (both public and embargoed), you can find them by clicking here. We're still developing this feature, so if you have any feedback please don't hesitate to let me know!
Thanks for sharing.
>The data that are available are described in this preprint (https://www.biorxiv.org/content/10.1101/2019.12.19.871897v1.full.pdf). The data described in that preprint can be found here: https://www.synapse.org/#!Synapse:syn19522967 - this folder contains both fastqs and Salmon quantification files. I suspect you are seeing bams and fastqs that are not yet released.
Ok. if this folder has “[All] the data that are available”, is there a reason it not contain the BAMs yielded by the following operation I previously described (shown immediately below)?
>In this case, applying the following filters to https://www.synapse.org/#!Synapse:syn13363852/: assay=exomeSeq, isCellLine=False, tumorType=Malignant Peripheral Nerve Sheath Tumor, fileFormat=bam, cram, and discarding rows that appear to have an association with a xenograft, yields 6 unique specimenIDs, and 7 samples -- consistent with the "Current Data" table.
>Thanks! We'll fix this.
Thank you.
>Each pair is a separate sample. There should be more samples than this, but I think this will be resolved after I make the fix described in my previous response.
Ok. Thank you for clarifying - this is important to know and not obvious as often a single sample will be comprised of more than 2 FASTQ files, especially if the sample was sequenced across multiple lanes.
Looking forward to hearing back soon and receiving further clarification on all this soon so that our group can begin re-processing this data.
Thank you,
Alon >Starting with RNA-seq data, I navigate to https://www.synapse.org/#!Synapse:syn13363852/ (Regarding this, could you confirm this is the broadest view of the data? I ask because due to an inconsistency I mention in my first post in this thread, i.e. that you link to a subset of this data in step (6) of Data Access) , and filter as follows: assay=rnaSeq, isCellLine=False, tumorType=Malignant Peripheral Nerve Sheath Tumor.
Yes, this is the broadest view of the data. Please note, however, that it includes metadata for both published and embargoed data, so you will see rows of data that you may not be able to actually retrieve. The point of this is to allow people to get a better idea of all of the data that exists and should be available in the future.
>At this point, I would like to exclude Xenograft samples. Since I am interested in re-processing your data, file formats of interest for me are bam and fastq, so I select fileFormat=fastq, bam. I now download this spreadsheet, which contains 156 rows. I sort on transplantationDonorSpecies and remove entries with transplantationDonorSpecies=Mouse. I see this is also possible to do with an advanced search rather than downloading the file. Finally, sorting on specimenID, I notice there are 16 entries per specimenID - all of them fastqs. In other words, there are 8 pairs of fastqs per specimenID, of which there are 6. I.e. 48 pairs of fastqs.
>My question is:
>Do each of these pairs of fastqs constitute a separate sample, or do each set of 8 pairs of fastq files available for each specimenID constitute 1 sample. In other words, to confirm, do you have a total of 6 samples, for 6 specimenIDs?
Each pair is a separate sample. There should be more samples than this, but I think this will be resolved after I make the fix described in my previous response.
>How does this reconcile with the information you provide in this table: https://www.synapse.org/#!Synapse:syn4939902/wiki/593716, in particular, the following line: Malignant Peripheral Nerve Sheath Tumor false geneExpression 16 6, indicates 6 "samples" and 16 “patients”.
As noted in my previous response, some of the data appear to have been deposited with specimenIds as patentiIds, so I will correct this. Let's revisit this then.
>A final note that this type of inconsistency between the Current Data table in https://www.synapse.org/#!Synapse:syn4939902/wiki/593716 and data present in https://www.synapse.org/#!Synapse:syn13363852/ does not appear to be present for exome data. In this case, applying the following filters to https://www.synapse.org/#!Synapse:syn13363852/: assay=exomeSeq, isCellLine=False, tumorType=Malignant Peripheral Nerve Sheath Tumor, fileFormat=bam, cram, and discarding rows that appear to have an association with a xenograft, yields 6 unique specimenIDs, and 7 samples -- consistent with the "Current Data" table.
Thanks for noting! I suppose the sample/patient metadata is correct for this, but will double check it regardless when I look at the expression data.
Hi Alon,
Thanks very much for reaching out! I hope that you are also well.
>First, I was hoping you could clarify the following issue with the metadata. Looking at this table: https://www.synapse.org/#!Synapse:syn4939902/wiki/593716, in particular, the following line: Malignant Peripheral Nerve Sheath Tumor false geneExpression 16 6, indicates 6 "samples" and 16 "patients" -- are there generally not either a greater or equal number of samples with respect to the number of patients when dealing with human subjects? Would you be able to provide some insight on this issue?
You're correct, there should be more patients than samples. Looking at the metadata, it appears that some of the data were deposited with sample ids as patient IDs, which is causing the error in counts.
I can try and correct this by the end of the day PT today, but probably not much sooner. Apologies for the error!
>Navigating to the "file view" mentioned in https://www.synapse.org/#!Synapse:syn4939902/wiki/593715, I only see 493 RNA-seq data files. However, navigating directly to https://www.synapse.org/#!Synapse:syn13363852/, there are 667 RNA-seq data files. Would you be able to clarify why you have chosen this view? I personally found this confusing.
This view predates my involvement in this project but my suspicion is that it was to provide a summary of the available data rather than a direct link to the data. Thank you for the feedback, though, we can tweak the presentation to make it clearer.
I do also want to note that we are spending considerable effort developing the NF Data Portal to better standardize these views across all studies, and to make finding and access data more easy. You can view the current portal page for this study here: https://nf.synapse.org/Explore/Studies/DetailsPage?studyId=syn4939902I also want to point you to a sneak peek (i.e. alpha mode) explore tool our portal team has been working on that may allow you to better search this and other data across the NF-OSI studies. For example, if you want to see all MPNST fastqs (both public and embargoed), you can find them by clicking [here](https://staging.nf.synapse.org/Explore/Files?QueryWrapper0=%7B%22sql%22%3A%22SELECT%20id%20AS%20%5C%22File%20ID%5C%22%2C%20assay%2C%20dataType%2C%20diagnosis%2C%20tumorType%2C%20%20species%2C%20individualID%2C%20%20fileFormat%2C%20dataSubtype%2C%20nf1Genotype%20as%20%5C%22NF1%20Genotype%5C%22%2C%20nf2Genotype%20as%20%5C%22NF2%20Genotype%5C%22%2C%20studyName%2C%20fundingAgency%2C%20consortium%2C%20name%20AS%20%5C%22File%20Name%5C%22%2C%20accessType%2C%20accessTeam%20%20FROM%20syn16858331%20WHERE%20resourceType%20%3D%20%27experimentalData%27%22%2C%22limit%22%3A25%2C%22offset%22%3A0%2C%22selectedFacets%22%3A%5B%7B%22concreteType%22%3A%22org.sagebionetworks.repo.model.table.FacetColumnValuesRequest%22%2C%22columnName%22%3A%22assay%22%2C%22facetValues%22%3A%5B%22rnaSeq%22%5D%7D%2C%7B%22concreteType%22%3A%22org.sagebionetworks.repo.model.table.FacetColumnValuesRequest%22%2C%22columnName%22%3A%22fileFormat%22%2C%22facetValues%22%3A%5B%22fastq%22%5D%7D%2C%7B%22concreteType%22%3A%22org.sagebionetworks.repo.model.table.FacetColumnValuesRequest%22%2C%22columnName%22%3A%22tumorType%22%2C%22facetValues%22%3A%5B%22Malignant%20Peripheral%20Nerve%20Sheath%20Tumor%22%5D%7D%5D%7D). We're still developing this feature, so if you have any feedback please don't hesitate to let me know!
>When attempting to download raw RNAseq data, when I clicked "Add to Download List", under "Download Options" in https://www.synapse.org/#!Synapse:syn13363874, and proceed to attempt to download the files by clicking on "Click to view items in your download list", I encountered a notice that reads "You must request access to this restricted file", under "Access". Is there a reason I am not able to access these files via this view, but am presumably able to download them via an alternative view?
This is a living repository, so some of the samples are not yet available even to those who have been granted access because they have not yet been published/finalized by the data contributor. The data that are available are described in this preprint (https://www.biorxiv.org/content/10.1101/2019.12.19.871897v1.full.pdf). The data described in that preprint can be found here: https://www.synapse.org/#!Synapse:syn19522967 - this folder contains both fastqs and Salmon quantification files. I suspect you are seeing bams and fastqs that are not yet released.
>Finally, please note that the link associated with the link in this statement: "To view the data files, please navigate to the file view that lists all the files and their associated metadata." redirects to the synapse home page.
Thanks! We'll fix this.
Following up on my previous post, I am now attempting to download the 7 samples mentioned in my "A final note" section above - however, when I navigate to my download list, I see that only 4/7 are available for download due to permission settings. Would you be able to grant me permission to download these files?
Another follow-up that in attempting to download the 96 RNA-seq files I mention earlier in this thread, only 59/96 are actually added to the download list. It appears that some of the data files are not available - their "dataFileHandleId" reads "Unable to load file data: undefined". Of there 59, I have download access to 39 of these. Would you please be able to provide some clarity regarding this?
Thanks a lot,
Alon
Thank you,
Alon I was hoping you might be able to clarify another issue!
**First some background:**
As stated earlier, I am most interested in downloading and re-proessing your MPNST human primary-tissue, non-xenograft RNA-seq and exome data.
Starting with RNA-seq data, I navigate to https://www.synapse.org/#!Synapse:syn13363852/ (Regarding this, could you confirm this is the broadest view of the data? I ask because due to an inconsistency I mention in my first post in this thread, i.e. that you link to a subset of this data in step (6) of Data Access) , and filter as follows: assay=rnaSeq, isCellLine=False, tumorType=Malignant Peripheral Nerve Sheath Tumor.
At this point, I would like to exclude Xenograft samples. Since I am interested in re-processing your data, file formats of interest for me are bam and fastq, so I select fileFormat=fastq, bam. I now download this spreadsheet, which contains 156 rows. I sort on transplantationDonorSpecies and remove entries with transplantationDonorSpecies=Mouse. I see this is also possible to do with an advanced search rather than downloading the file.
Finally, sorting on specimenID, I notice there are 16 entries per specimenID - all of them fastqs. In other words, there are 8 pairs of fastqs per specimenID, of which there are 6. I.e. 48 pairs of fastqs.
**My question is:**
Do each of these pairs of fastqs constitute a separate sample, or do each set of 8 pairs of fastq files available for each specimenID constitute 1 sample. In other words, to confirm, do you have a total of 6 samples, for 6 specimenIDs? How does this reconcile with the information you provide in this table: https://www.synapse.org/#!Synapse:syn4939902/wiki/593716, in particular, the following line: Malignant Peripheral Nerve Sheath Tumor false geneExpression 16 6, indicates 6 "samples" and 16 “patients”.
**A final note:**
A final note that this type of inconsistency between the Current Data table in https://www.synapse.org/#!Synapse:syn4939902/wiki/593716 and data present in https://www.synapse.org/#!Synapse:syn13363852/ does not appear to be present for exome data. In this case, applying the following filters to https://www.synapse.org/#!Synapse:syn13363852/: assay=exomeSeq, isCellLine=False, tumorType=Malignant Peripheral Nerve Sheath Tumor, fileFormat=bam, cram, and discarding rows that appear to have an association with a xenograft, yields 6 unique specimenIDs, and 7 samples -- consistent with the "Current Data" table.
Thanks for your help,
Alon
@sgosline @allawayr hope you both are well! Tagging you both here in case you were not notified of my post. Thanks in advance for your hard work and help!
Very best,
Alon
Drop files to upload
Clarifications Regarding Metadata and Various Issues page is loading…