Hello, I am working with the harmonized RNAseq data ([syn30821562](syn30821562)) and was wondering if it is possible to obtain the residual CPM after regressing out age and sex. The available options currently are (i) age, (ii) diagnosis, (iii) diagnosis + sex, and (iv) diagnosis + sex + age. Although the sex information is available in the cohort metadata, since age is censored, I am unable to regress the combination of age+sex myself. Thanks in advance.

Created by Upamanyu Ghose ug96
Thank you very much for the detailed response! I have been working on a pipeline based on your guidance and I think it removes the strong batch effects obeserved in the data. I'm focussing on DLPFC for the time being and I'd be happy to share the script and sanity check plots within the next few days to see how it compares with the ongoing re-analsysis of the data.
Hello, I am so sorry for losing track of this. There is some R notebook output for [Mayo](syn27024974), [MSBB](syn27068766), and [ROSMAP](syn26967461) in the differential expression folder of this project that goes over what variables they used. They used `sageseqr` to analyze the data, which uses `mvIC` to automatically determine what variables to use, then runs `variancePartition::dream`, then adds biological effects (diagnosis, age, sex) back to the residuals. However it looks like they regressed all the tissues together instead of individually, their outlier detection code doesn't work as intended, and the models included covariates that were highly related to each other, so I wouldn't recommend using their formulas or `sageseqr`. Neither I nor my colleague have finished re-analyzing this data, and we are not using `sageseqr`, but I can give you some rough idea of what I've found so far: * You might want to try `svaseq` and/or `ComBat_seq` to find surrogate variables and remove batch effects -- my data looks pretty good on a PCA plot after using ComBat_seq on each of the tissues, with "diagnosis" as the `group` and flowcell/sequencingBatch as the `batch`. If instead you want to remove specific variables, some general observations about each data set are below: * The Mayo tissues have almost no batch effects. You could use RIN and pick **one** of the `RnaSeqMetrics_` variables in their formula based on correlation with a PCA of the data. * The MSBB tissues do have batch effects so you need batch in the model. You should also use sex and RIN, and pick one of the picard metrics, but look at the correlation of the metrics with a PCA from each tissue to decide which one. * The ROSMAP tissues are difficult and I haven't finished looking at these yet. Basically -- 1/3 of the data was extracted/processed with one protocol and the other 2/3 with a different protocol, and a small portion of the 1/3 data was sequenced with the 2/3 data, so there are pretty severe batch effects. (See the [wiki describing their methods](https://www.synapse.org/Synapse:syn3388564)). You will likely need sequencingBatch, libraryPrep, sex, and RIN in the model plus at least one of the picard metrics for each tissue. When you plot a PCA of the ACC tissue, you can see that there is some technical effect on the data that isn't accounted for by any of the metrics and I haven't been able to figure out what. * Picard metrics can be found in each dataset's folder in the [raw data](https://www.synapse.org/Synapse:syn26720675) folder of this project. I'm sorry I don't have more specific advice, this is an on-going re-analysis, but I hope that was at least a little helpful. Jaclyn Beck
Hello, I was wondering if its possible to get some help with my question. Thanks in advance. Best, Upamanyu
Hello Jaclyn, Thanks very much for your response. The information is really helpful. If I were to do my own analysis, I am guessing that I would also want to regress out technical covariates specific to RNA-seq. I could not find a pipeline or description of what was done (or is being done with the re-analysis) to obtain the residual counts so not sure if any other variables were also regressed out. Would it be possible for you to help me with this information? Best, Upamanyu
Hello, The files are named very confusingly. The variables listed in each file name have _not_ been regressed out in that file. So a file that just has "diagnosis" in the name should have both age and sex regressed out while leaving the effect of diagnosis in the results. I think these are probably the files you want. The only exception is the file with just "age", which has sex regressed out but not age or diagnosis even though diagnosis isn't in the name. If you want to do your own regression anyway, you can request un-censored ages from [Rush/RADC](https://www.radc.rush.edu/) or bin ages into groups, which is the approach we have started using at Sage. To be honest, I would recommend doing your own regression and not using these regressed matrices. There are some issues with them that we have recently found and we will be re-doing this analysis/regression and replacing the data very soon. I hope that helps, and let me know if you have any more questions. Jaclyn Beck

Harmonized RNAseq residual CPM with age+sex page is loading…