I have a question related with number of samples in MayoRNAseq dataset. I hope you can help me with that.
When I search for "syn9738945" and perform those filterings in order "rnaSeqReprocessing" - "fastq" - "MayoRNAseq" -"Temporal Cortex", I am obtaining 552 fastq files (276 samples).
According to Allen 2016 et. al article, 80 Control and 84 Alzheimer's Disease subjects must be included in this dataset. I downloaded all 552 fastq files, and checked whether the number of AD and CONTROL samples are consistent with the numbers given in the article. Among downloaded files, I found 82 AD and 78 Control samples. 2 AD and 2 Control samples were removed from the dataset I see.
Were those samples actually removed? Or am I missing something? If they were removed, then why?
I will be glad if you explain me the difference in the number of samples, and please let me know if I am missing something.
Thank you so much for your time!
Created by Dilara Hello @mxa24 ,
Thank you so much for your detailed explanation. It helped a lot.
Best wishes,
Hello @Dilara
The data was shared early in the research cycle before completion of QC and analysis. Therefore the most up to date information can be found in the study documentation on synapse which was/is updated as more information becomes available.
- The number of fastq files of 552 should be correct, due to the two TCX samples that failed sex check not being included in the RNAseq reprocessing effort.
- The number of AD and controls can be determined by linking the biospecimen and individual metadata files and referencing the diagnosis, exclude and excludeReason fields.
- There were 3 individuals (1950, 1925 and 1957) that were later identified as no longer being controls upon updates to neuropathology information and so they were flagged in the exclude columns above and masked (NA) in the metadata files so that they would not be inadvertently included in analysis.
- RNAseq samples from these 3 individuals have an exclude reason of "(Pathology) - Does not meet control criteria (Braak > 3.0)".
- Two of the above three "control" individuals were part of the TCX dataset: 1950_TCX and 1925_TCX which is why the updated metadata has 78 controls and not 80.
I hope that clears things up for you!
Mariet Hello @abby.vanderlinden and @mxa24 ,
Thank you so much for your help.
According to my calculations, there should be two more missing subjects. According to Allen 2016 et. al. article, 80 Control and 84 Alzheimer's Disease subjects should be included but there are 78 Control and 82 AD subjects existing. So, 2 Control and 2 AD subjects are missing.
You said 2 of them were removed (132_TCX and 844_TCX). Is there any possibility that there should be 2 more missing subjects?
By the way, I checked the flagged samples. 29 samples were flagged. There are 23 samples between [1005-1123] TCX, and they are already included in 552 fastq files. On the other hand, remaining 6 subjects belong to progressive supranuclear palsy samples.
Thank you so much for your time.
Hello @abby.vanderlinden and @Dilara
Some samples were flagged due to failing various quality control metrics - these are flags provided in the biospecimen metadata file (syn20827192).
There were 2 TCX samples that failed sex check and it was decided to remove the data for these from the study (132_TCX and 844_TCX).
There are 278 TCX samples, with 2 fastq files per sample = 556. Removing the 4 fastq files for the 2 samples failing sex-check results in 552 TCX fastq files.
I hope that helps!
Mariet Hi there, it does seem like there are only 552 fastq files from the Mayo rnaSeqReprocessing study samples (the underlying folder is here: https://www.synapse.org/#!Synapse:syn8612203).
I'm not sure why this doesn't match up with the numbers reported in the paper. Perhaps @mxa24 can help?
Drop files to upload
Number of Samples In Mayo RNA-Seq Dataset page is loading…