Hi,
I have noticed that under the same barcode there are some resequenced bam files, i.e., hB_RNA_9212.accepted_hits.sort.coord.bam ([syn5519241](syn5519241)) and hB_RNA_9212_resequenced.accepted_hits.sort.coord.bam ([syn5853553](syn5853553)).
I would like to know how the MSBB team deal with these resequenced data in their subsequent analysis. Do they chose the higher quality one or merge the two files together?
Thanks,
Wei Hong
Created by Wei Hong weihong1991 John,
The batch information is correct. The file naming is unfortunately confusing due to various rounds of transferring/relocation.
Minghui
Thank you @Mette for resolving this issue so quickly.
I might also suggest you rename the RNAseq_covariates file with a "\_December2018Update" suffix to reflect that it includes this fix. We were confused a bit on our end that this issue, first reported on 12/17/2018, appeared to have been fixed in a file with "\_November2018Update".
I had another question about a similar issue. We have historically paired the ".accepted_hits.sort.coord.bam" file with the ".unmapped.fastq.gz" file based on the initial part of the filename. By paired, I mean treated as if they came from the same sequencing run, when we have done analysis that required realigning from scratch.
However with the inclusion of complete batch information (for all RNAseq runs) in the RNAseq_covariates file, we've used this information to pair the files. I found one exception of sorts
Based on filename, these two go together
hB_RNA_10892.accepted_hits.sort.coord.bam
hB_RNA_10892.unmapped.fastq.gz
And these two (maybe)
hB_RNA_10892_K77C014.accepted_hits.sort.coord.bam (just added recently by @karawoo)
hB_RNA_10892_resequenced.unmapped.fastq.gz
However, based on batch, the pairing appears to be different.
Filename|Batch|synapseID
hB_RNA_10892.accepted_hits.sort.coord.bam|E007C014|[syn5850147](syn5850147)
hB_RNA_10892.unmapped.fastq.gz|K77C014|[syn5519316](syn5519316)
hB_RNA_10892_K77C014.accepted_hits.sort.coord.bam|K77C014|[syn17013947](syn17013947)
hB_RNA_10892_resequenced.unmapped.fastq.gz|E007C014|[syn5853375](syn5853375)
Can you confirm the batch information is correct for these two sequencing runs? Might it be wise to rename the files if that is the case?
The new filenaming including the batch (everything uploaded recently by @karawoo) is a great improvement in general! @Mette @minghui.wang Thank you very much! @weihong1991 - please see a new version of the file in syn6100548. Thanks for pointing this out Hi Wei,
Thanks for the great question. In addition to RIN score and rRNA rate criteria, sequencing depth is another factor that is used to select the best sample from replicates. In this case, the one with more mapped reads will be selected.
Regarding the mismatched batch id for hB_RNA_9212, it seems the current metadata file is not the right version. @Mette I think there is something wrong here. I will re-send you the updated version in case you can't find it.
Minghui I will have to refer your question to the data contributor. @minghui.wang - cam you please take a look at the question above. Thanks! Thanks Mette,
I have checked the metadata file, both samples were marked as "Okay". In addition, both samples have RIN>4 and rRNA rate <5%, passed the QC thresholds descrbed in [syn3157743](syn3157743).
However, there is only one "hB_RNA_9212" in the expression matrix.
Besides, some bam files and corresponding unmapped fq files have different batch IDs. I.e., according to [syn6100548](syn6100548), the batch ID of hB_RNA_9212.accepted_hits.sort.coord.bam ([syn5519241](syn5519241)) is E007C014 and hB_RNA_9212.unmapped.fq.gz ([syn5519720](syn5519720)) is E2C014. Since the two files came from the same raw sequencing data, I suppose they should have the same batch ID.
Thank you very much,
Wei Hong I recommend first taking a look at the RNAseq metadata file. It contains information on which samples were used: syn6100548
Drop files to upload
Question about resequenced data in MSBB page is loading…