Hi,
In syn3157325, the ROSMAP genotyping data, which can be found here: https://www.synapse.org/#!Synapse:syn3157325, the study description says: after sample-level and SNPs-level quality control "The resultant dataset includes 672,266 SNPs on 1709 individuals". So, I think the ROSMAP_arrayGenotype.bed/fam/bim which can be downloaded from that page are QC passed data. But when I run plink on them using "plink -bfile ROSMAP_arrayGenotype --mind 0.05 --out test --make-bed", the results were as following:
--------- plink report -----------
750173 variants loaded from .bim file.
1708 people (527 males, 1181 females) loaded from .fam.
617 people removed due to missing genotype data (--mind).
750173 variants and 1091 people pass filters and QC.
1. So, my first question is that: the bed/fam/bim file provided to download is before QC or after QC? Why the SNPs# and individuals# in description are different from what I can see from the data.
2. I'd like to see all the RNA-seq samples are also genotyped, but unfortunately, I find a lot of RNA-seqed samples are not genotyped.
Like samples whose gwas_id start with "11AD" e.g. 11AD39714, 11AD39715, 11AD39716..., we can see from the "ROSMAP_IDKey" file, all these samples should have genotyping data (gwas_data = 1). But I cannot find them in ROSMAP_arrayGenotype.fam.
Also there are some other samples like projid = 38967303, 10260309, 44749170... whose gwas_data = 0, Are these samples not genotyped?
3. The last question is that how to define "AD" or "healthy" based on the clinical data? The description of syn3157325 said that 19% are AD individuals, so how do you define AD according to clinical data?
Thanks for your time and any advice would be appreciated!
Best,
Tao
Created by Tao Wang twang Thanks Solly,
This tool removes candidate mis-strand palindromic SNPs (and also other SNPs) whose allele frequency difference >0.2 compared to reference panel. It can solve the problem to some extend. But I think it will be better if ROSMAP can provide the prior strand knowledge of the genotyping data so that we can rescue more SNPs. For example, on reference panel, SNP1 has genotype C(ref)/G(alt) (Frequency: 0.9/0.1). And in a study, SNP1 is reported as G/C with frequency = 0.9/0.1. Under this assumption, the tool will discard this SNP as the frequency difference = 0.8 > 0.2. But if we prior know this SNP1 is measured on '-' strand we can rescue this SNP by flipping it first.
Best,
Tao Tao-
I suggest the HRC-1000G-check-bim tool on that page. It automatically performs all the checks you need prior to imputation.
Solly
Thanks Solly! The Chip strand data I mentioned was actually downloaded from that webpage. And I looked other tools also, but don't find any one can solve my problem: How to decide A/T and G/C SNPs are on which strand. How did you treat those SNPs before imputation? I found ~15% markers are A/T and G/C genotypes in ROSMAP genotyping data.
Tao
Tao-
You'll want to check out the tools here as suggested in the Michigan Imputation Server instructions: http://www.well.ox.ac.uk/~wrayner/tools/. There are several tools for preparing your data for imputation.
Solly Hi Solly,
I'm working on the imputation in separate batches, but having a problem on strand flipping. Hope I can get your help!
After QC, I did strand flipping based on Affy 6.0 chip strand file(download at http://www.well.ox.ac.uk/~wrayner/strand/index.html). I flipped all SNPs which are on minus strand using Plink. For example, if SNP1 is on '-' strand, then SNP1 will be flipped. After strand flipping, I did imputation on Michigan Imputation Server, but there is an error said too many strand flips were detected.
I looked into the source genotyping file (syn3157325) and found the SNPs are not always on the same strand as found in the chip strand file. For example, in the downloaded genotyping file, rs6576700 is G/A, and rs10508202 is G/T. And according to Affy 6.0 chip strand file, rs6576700 is on '-' strand with genotype G/A, and rs10508202 is on '-' strand with genotype A/C. That means rs10508202 is flipped but rs6576700 is not.
I know the unambiguous SNPs like A/C or G/T can be easily flipped(or not) based on reference genome. But I don't know how to flip SNPs like A/T and G/C. So how did you do the strand flipping before imputation? I'm new to this field and any advice would be appreciated!
Best,
Tao
Thanks Solly! You are very helpful! Tao-
As I mentioned in my previous reply, the samples were genotyped in 2 batches. They are easy to infer based on missingness by simply assessing missingness by sample.
As for considering BRAAK or other measures, that would heavily depend on the analysis you're running and the question you want to ask, and is not something that I can answer for you.
Solly Hi Solly,
Thanks for your reply! It's really helpful! Can you tell me more details about inferring the batches based on the missingness? How many batches do you find?
About the AD definition, besides the cogdx score which only reflect the patient's clinical status before dying, do you think we should also consider the autopsy score like braaksc which reflects size of neurofibrillary tangles?
Thanks!
Tao Hi Tao-
I'm not the data owner, but I've done some analyses on the data, so I can answer some of these questions.
1. Those data are post-QC. My understanding is that the genotyping and QC were done in two different batches, and one batch had substantially more SNPs removed than the others. That is why there are so many samples removed based on missingness. In my analysis I inferred the batches on the basis of missingness, and performed additional QC and imputation in the separate batches.
3. Clinical data can be found in this file: https://www.synapse.org/#!Synapse:syn3191087. AD can be interpreted from the cogdx variable which is described here: https://www.synapse.org/#!Synapse:syn3191090.
I don't know the answer to (2).
Solly
Drop files to upload
syn3157325: ROSMAP genotyping data QC problem and samples missing problem. page is loading…