Hello,
I was exploring the data in the diverse cohorts project ([syn51732482](syn51732482)) and noticed that not all samples that have RNAseq data also have WGS data and vice-versa. Looking at the biospecimen metadata ([syn51757645](syn51757645)), this is what I see:
```
> divco_biospec_meta <- read.csv("AMP-AD_DiverseCohorts_biospecimen_metadata.csv")
> rnaSeq_iids <- unique(divco_biospec_meta[divco_biospec_meta$assay == "rnaSeq", ]$individualID)
> wgs_iids <- unique(divco_biospec_meta[divco_biospec_meta$assay == "wholeGenomeSeq", ]$individualID)
> length(wgs_iids)
[1] 761
> length(rnaSeq_iids)
[1] 844
> length(intersect(wgs_iids, rnaSeq_iids))
[1] 606
> length(setdiff(wgs_iids, rnaSeq_iids))
[1] 155
> length(setdiff(rnaSeq_iids, wgs_iids))
[1] 238
```
I was wondering if I am missing something or doing something wrong in checking the overlap of samples with both assays, or if this is the case. Thanks in advance.
Best regards,
Upamanyu
Created by Upamanyu Ghose ug96 Hello,
I was wondering if someone is able to help me with the question in the previous message.
Additionally, is there a planned analysis or release of jointly called WGS variants from the DivCo\_HS [syn68898203](syn68898203) and WGS\_Harmonization [syn22264775](syn22264775) studies? I want to merge the VCFs from the two, but the intersection between the common variants (MAF > 0.05) in the VCFs from "Diverse Cohorts + AMP 1.0" [syn68898203](syn68898203) and the WGS\_Harmonization [syn11707420](syn11707420) is quite small and I am not sure if it would be correct to merge the VCFs with such a small intersection.
## VCF Normalisation and MAF filter applied to both VCFs
```bash
bcftools norm \
--no-version \
-m- \
-Ob \
--threads 4 \
-o "$SPLIT_VCF" \
"$INPUT_VCF"
bcftools index "$SPLIT_VCF"
echo
echo "========================="
echo " Importing VCF into PLINK format..."
echo "========================="
plink2 \
--bcf "$SPLIT_VCF" \
--set-all-var-ids "chr@_#_\$r_\$a_b38" \
--new-id-max-allele-len 1000 \
--make-pgen \
--out "${TEMP_DIR}/${BASE_NAME}"
echo
echo "========================="
echo " Applying MAF and HWE filters..."
echo "========================="
plink2 --pfile "${TEMP_DIR}/${BASE_NAME}" \
--maf 0.05 \
--snps-only \
--rm-dup exclude-all \
--make-pgen \
--out "${TEMP_DIR}/${BASE_NAME}_filtered"
plink2 --pfile "${TEMP_DIR}/${BASE_NAME}_filtered" \
--hwe 1e-6 midp \
--make-pgen \
--out "${TEMP_DIR}/${BASE_NAME}_hwe_filtered"
plink2 --pfile "${TEMP_DIR}/${BASE_NAME}_hwe_filtered" \
--output-chr chrM \
--export bcf \
--out $OUTPUT_VCF
```
## VCF intersection between DivCo\_HS and WGS\_Harmonization
```bash
bcftools isec \
"$vcf1" "$vcf2" \
-p "$tmpdir/isec"
VCF1_ONLY=$(bcftools view -H "$tmpdir/isec/0000.vcf" 2>/dev/null | wc -l)
VCF2_ONLY=$(bcftools view -H "$tmpdir/isec/0001.vcf" 2>/dev/null | wc -l)
SHARED=$(bcftools view -H "$tmpdir/isec/0002.vcf" 2>/dev/null | wc -l)
TOTAL_VCF1=$(( VCF1_ONLY + SHARED ))
TOTAL_VCF2=$(( VCF2_ONLY + SHARED ))
OVERLAP_PCT=$(awk "BEGIN {printf \"%.2f\", 100*$SHARED/$TOTAL_VCF1}")
echo "Overlap Results:"
echo " VCF1 unique: $VCF1_ONLY"
echo " VCF2 unique: $VCF2_ONLY"
echo " Shared: $SHARED"
echo " Overlap %: ${OVERLAP_PCT}%"
```
### Sample output for chromosome 1:
> Overlap Results:
> VCF1 unique: 174922
> VCF2 unique: 41569
> Shared: 367566
> Overlap %: 67.76%
Thank you.
Best,
Upamanyu
Drop files to upload
Overlap of samples with WGS and RNAseq in AMP-AD Diverse Cohorts page is loading…