Hello,
I was exploring the data in the diverse cohorts project ([syn51732482](syn51732482)) and noticed that not all samples that have RNAseq data also have WGS data and vice-versa. Looking at the biospecimen metadata ([syn51757645](syn51757645)), this is what I see:
```
> divco_biospec_meta <- read.csv("AMP-AD_DiverseCohorts_biospecimen_metadata.csv")
> rnaSeq_iids <- unique(divco_biospec_meta[divco_biospec_meta$assay == "rnaSeq", ]$individualID)
> wgs_iids <- unique(divco_biospec_meta[divco_biospec_meta$assay == "wholeGenomeSeq", ]$individualID)
> length(wgs_iids)
[1] 761
> length(rnaSeq_iids)
[1] 844
> length(intersect(wgs_iids, rnaSeq_iids))
[1] 606
> length(setdiff(wgs_iids, rnaSeq_iids))
[1] 155
> length(setdiff(rnaSeq_iids, wgs_iids))
[1] 238
```
I was wondering if I am missing something or doing something wrong in checking the overlap of samples with both assays, or if this is the case. Thanks in advance.
Best regards,
Upamanyu
Created by Upamanyu Ghose ug96 hello @ug96,
Apologies for the very late reply. You are correct that there is not a perfect overlap of WGS and RNASeq donors in this study. Your numbers align with the published number of donors and content in the metadata, but I will double check with the raw fastq/bam files and let you know if I find any discrepancies.
As for the joint calling effort, we unfortunately do not currently have the compute budget to reprocess all of the AMP 1.0 WGS data with an updated reference genome, so I am not surprised by the poor intersection. I am considering performing a genome liftover as a short term solution to make the old WGS harmonization data more useful, but don't yet have a concrete timeline for releasing this.
Best,
Will Hello,
I was wondering if someone is able to help me with the question in the previous message.
Additionally, is there a planned analysis or release of jointly called WGS variants from the DivCo\_HS [syn68898203](syn68898203) and WGS\_Harmonization [syn22264775](syn22264775) studies? I want to merge the VCFs from the two, but the intersection between the common variants (MAF > 0.05) in the VCFs from "Diverse Cohorts + AMP 1.0" [syn68898203](syn68898203) and the WGS\_Harmonization [syn11707420](syn11707420) is quite small and I am not sure if it would be correct to merge the VCFs with such a small intersection.
## VCF Normalisation and MAF filter applied to both VCFs
```bash
bcftools norm \
--no-version \
-m- \
-Ob \
--threads 4 \
-o "$SPLIT_VCF" \
"$INPUT_VCF"
bcftools index "$SPLIT_VCF"
echo
echo "========================="
echo " Importing VCF into PLINK format..."
echo "========================="
plink2 \
--bcf "$SPLIT_VCF" \
--set-all-var-ids "chr@_#_\$r_\$a_b38" \
--new-id-max-allele-len 1000 \
--make-pgen \
--out "${TEMP_DIR}/${BASE_NAME}"
echo
echo "========================="
echo " Applying MAF and HWE filters..."
echo "========================="
plink2 --pfile "${TEMP_DIR}/${BASE_NAME}" \
--maf 0.05 \
--snps-only \
--rm-dup exclude-all \
--make-pgen \
--out "${TEMP_DIR}/${BASE_NAME}_filtered"
plink2 --pfile "${TEMP_DIR}/${BASE_NAME}_filtered" \
--hwe 1e-6 midp \
--make-pgen \
--out "${TEMP_DIR}/${BASE_NAME}_hwe_filtered"
plink2 --pfile "${TEMP_DIR}/${BASE_NAME}_hwe_filtered" \
--output-chr chrM \
--export bcf \
--out $OUTPUT_VCF
```
## VCF intersection between DivCo\_HS and WGS\_Harmonization
```bash
bcftools isec \
"$vcf1" "$vcf2" \
-p "$tmpdir/isec"
VCF1_ONLY=$(bcftools view -H "$tmpdir/isec/0000.vcf" 2>/dev/null | wc -l)
VCF2_ONLY=$(bcftools view -H "$tmpdir/isec/0001.vcf" 2>/dev/null | wc -l)
SHARED=$(bcftools view -H "$tmpdir/isec/0002.vcf" 2>/dev/null | wc -l)
TOTAL_VCF1=$(( VCF1_ONLY + SHARED ))
TOTAL_VCF2=$(( VCF2_ONLY + SHARED ))
OVERLAP_PCT=$(awk "BEGIN {printf \"%.2f\", 100*$SHARED/$TOTAL_VCF1}")
echo "Overlap Results:"
echo " VCF1 unique: $VCF1_ONLY"
echo " VCF2 unique: $VCF2_ONLY"
echo " Shared: $SHARED"
echo " Overlap %: ${OVERLAP_PCT}%"
```
### Sample output for chromosome 1:
> Overlap Results:
> VCF1 unique: 174922
> VCF2 unique: 41569
> Shared: 367566
> Overlap %: 67.76%
Thank you.
Best,
Upamanyu
Drop files to upload
Overlap of samples with WGS and RNAseq in AMP-AD Diverse Cohorts page is loading…