Hello, I am confused about the RNA-seq data from Release3 (described in the paper from Hoffman 2019). I see the "expected_counts" and "length" under the "Files ControlledAccess Data RNAseq Release3 QuantitatedExpression", but I am not understanding which normalization steps have already been performed. In particular, - Are these already corrected for principal components (PCs), or is it necessary to apply PCA to these expected counts to correct for batch effects and so on? - Are the log2 counts per million, FPKM or TPM values, available somewhere? - Have these steps described in the paper already been performed? "Quality control metrics were reported with RNA-SeqQC (v1.1.7). All analysis used log2 counts per million (CPM) following TMM normalization implemented in edgeR (v3.22.5). Correction for GC content bias was performed with cqn (v1.26.0). Genes with over 1 CPM in at least 50% of the experiments were retained." Many thanks for any clarifications! All Best, Claudia

Created by Claudia Giambartolomei clagiamba
Hi @kelsey, perfect, it is all clear now! Many thanks for clarifying!
Thank you for reaching out @clagiamba. Release 4 does not contain all of the samples of release 3 since Hoffman et al. 2021 analyzed neurotypical individuals and individuals diagnosed with schizophrenia (+ other inclusion criteria like age that further reduced the sample size). The log2 counts per million, filtered, and conditional quantile normalized (CQN) counts, as referred to in [Hoffman et al. 2019](https://www.nature.com/articles/s41597-019-0183-6.pdf?origin=ppub) , were generated but not released publicly. I will also include a markdown where we document our criteria for outlier samples identified by PCA. I will release these files and tag you in this discussion when they are available. It is important to note that TMM normalization was dropped from our pipeline, thus those counts will not be a part of this release.
Also, just looking for the counts to repeat the normalization: under Release 4 QuantitatedExpression NIMH.HBCC.featureCount.tsv.gz + MSSM.Penn.Pitt_DLPFC.featureCount.tsv.gz, there are counts for 957 individuals but the paper talks about 981 individuals with both RNAseq and genetics data, were some individuals already removed from these data then? Thanks again!
I don't see that all samples from Release 3 used in Hoffman et al 2019 are in Release 4: - it looks like the Release 4 QuantitatedExpression (residuals and voom) include only 777 with SCZ, Am I missing something? Maybe the normalized file described in Hoffman et al 2019 is located somewhere else? Thank you very much for all the help.
Release 4 should have all the samples used in Hoffman et al 2019, so if you're looking for normalized expression, you can just use this. @kelsey can you confirm?
Ok thank you for the quick reply. Will the normalized counts described in the paper "CommonMind Consortium provides transcriptomic and epigenomic data for Schizophrenia and Bipolar Disorder" be available, or even just the outliers identified from PCA in "RNA-seq quality control"? For example, in both Release 1 and Release 4, these are available. Release 1 there were files called : NormalizedExpression " *adjustedLogCPM. Release 4 QuantitatedExpression Voom results for each cohort and Residuals correcting for covariates and add back - (Dx_Sex Dx Sex) Thanks
These are expected counts as quantified by RSEM. They have not been corrected or normalized.

syn18103849 RNA-seq log2 counts per million for CMC & CMC_HBCC studies page is loading…