Hi,
I've been trying to map reads in raw MSBB fastq files to hg38 genome instead of converting mapped bam files back to fastq files. The problem is that I've noticed that the fastq files are generally 20-40 times smaller than paired bam files (~10 times smaller after uncompress), which you can see clearly on the download page (https://www.synapse.org/#!Synapse:syn7416949). I then took a look at the FastQC results of the fastq files and compared the total number of reads shown there with that reported by the meta file (MSBB_RNAseq_covariates.csv). Here are the distributions of number of reads reported by QC and by meta file:
**FastQC**
![Number of reads reported by FastQC](https://i.imgur.com/kZ9hsb2.png)
**Meta file**
![Number of reads reported in meta file](https://i.imgur.com/6coB651.png)
As you can see, the average number of reads reported by FastQC is less than 1 million while the average number of reads reported in meta file is around 40 million. Does this mean that the unmapped fastq files are somehow incomplete/damaged? If so, would you please upload complete fastq files as replacement?
Thanks,
Yikai
Created by Yikai Luo yikai1014 Hi Dr. Peters,
Thanks a lot for the clarification! I will download the bam files remapped to hg38 as replacement.
Yikai Hi @yikai1014 - the fastq files in syn7416949 are just the unmapped reads not included in the aligned bam files. They were provided in order to enable users to recreate the full fastq. You may want to use the data from the RNAseq reprocessing project - syn9702085. This project is an AMP-AD consortium collaboration where bam files from 3 large RNAseq studies )including MSBB) were converted back to fastq and aligned to hg38. Here is a description of what was done and the data files - syn17010685
Drop files to upload
MSBB unmapped fastq files (syn7416949) have way lower number of reads than reported in meta file page is loading…