Hi,
I'm having problems with finding the information about how some data are processed (and to what experiment does it correspond to). I am after RNAseq raw read counts and generally looking for .tsv files.
If I go to the MayoRNAseq Study data section, there is MayoRNAseq_RNAseq_TCX_geneCounts.tsv (syn4650257), MayoRNAseq_RNAseq_TCX_geneCounts_normalized.tsv (syn4650265), MayoRNAseq_RNAseq_CBE_geneCounts.tsv (syn5201012), and some others. When I open these files in the synapse it allows me to backtrack the file paths to get some idea what these correspond to. With this I was able to get to syn6126177, which lists all counts for cerebellum samples of MayoRNAseq.
In this particular case, there are 5 .tsv count files. What might be the difference between MayoRNAseq_RNAseq_CBE_transcriptCounts.tsv (syn5600773) and MayoRNAseq_RNAseq_CBE_geneCounts.tsv (syn5201012)? There is also MayoRNAseq_RNAseq_CBE_transcriptCounts_normalized.tsv (syn6126177), but I cannot find what normalisation this is supposed to be.
These files belong to the Gene Expression (RNAseq - SNAPR) section for which in the methods section it is stated that:
"Explanation of available files and post-processing: The individual read count files produced by SNAPR are merged into a single file: "AMP-AD MayoRNAseq UFL-Mayo-ISB mRNA Alzheimers Disease IlluminaHiSeq2000 CBE geneExp raw count Homo sapiens" with combine_count_files.pl. These merged count files are normalized with the tmm_normalization.R script which uses the edgeR implementation of TMM to calculate CPM. Differences in library size were normalized across samples using the EdgeR function calcNormFactors. Normalized read counts were then converted to cpm with the cpm function, also in EdgeR. These normalized counts are saved as "AMP-AD MayoRNAseq UFL-Mayo-ISB mRNA Alzheimers Disease IlluminaHiSeq2000 CBE geneExp TMM normalized Homo sapiens"."
Which does not help me much.
I think there must be easier way to find this out, but I'm just missing it.
Thanks,
T
Created by Tapio Nevalainen newsky Thanks a lot!
T Hello,
It looks like for the files that end in "transcriptCounts.tsv", reads were summarized/counted at the transcript level (where there may be multiple transcripts per gene), while the files that end in "geneCounts.tsv" were summarized at the gene level. Unless you need information about which transcripts within a gene are more/less abundant, the "geneCounts.tsv" files are probably what you want.
As far as normalization, based on their description it sounds like they did this procedure in R (code not tested):
```
library(edgeR)
y <- DGEList(counts_matrix)
y <- calcNormFactors(y, method = "TMM")
y <- cpm(y)
```
So the "normalized.tsv" counts files should have values that are equivalent to [counts per million (CPM) x a scaling factor for each sample] which edgeR calculated. You can check that my assumption is right by seeing if the column sums are all close to 1e6.
As far as the names given in the methods writeup (like "AMP-AD MayoRNAseq UFL-Mayo-ISB mRNA Alzheimers Disease IlluminaHiSeq2000 CBE geneExp raw count Homo sapiens"), I think it's highly likely that when they put the data on Synapse they shortened the filenames to what we see now, and forgot to update the description to match.
So in summary I think what you want are the files ending in "geneCounts.tsv", which are the raw reads for the temporal cortex and cerebellum. Or, if you want to use their TMM-normalized values, you would want the "geneCounts_normalized.tsv" files instead.
Hopefully that answers your questions, but if I missed something or you have more questions please reply here and I'll do my best to help.
Jaclyn
Drop files to upload
Problems with linking data with the method descriptions page is loading…