Hello, I was exploring the data in the diverse cohorts project ([syn51732482](syn51732482)) and noticed that not all samples that have RNAseq data also have WGS data and vice-versa. Looking at the biospecimen metadata ([syn51757645](syn51757645)), this is what I see: ``` > divco_biospec_meta <- read.csv("AMP-AD_DiverseCohorts_biospecimen_metadata.csv") > rnaSeq_iids <- unique(divco_biospec_meta[divco_biospec_meta$assay == "rnaSeq", ]$individualID) > wgs_iids <- unique(divco_biospec_meta[divco_biospec_meta$assay == "wholeGenomeSeq", ]$individualID) > length(wgs_iids) [1] 761 > length(rnaSeq_iids) [1] 844 > length(intersect(wgs_iids, rnaSeq_iids)) [1] 606 > length(setdiff(wgs_iids, rnaSeq_iids)) [1] 155 > length(setdiff(rnaSeq_iids, wgs_iids)) [1] 238 ``` I was wondering if I am missing something or doing something wrong in checking the overlap of samples with both assays, or if this is the case. Thanks in advance. Best regards, Upamanyu

Created by Upamanyu Ghose ug96
Hello, I was wondering if someone is able to help me with the question in the previous message. Additionally, is there a planned analysis or release of jointly called WGS variants from the DivCo\_HS [syn68898203](syn68898203) and WGS\_Harmonization [syn22264775](syn22264775) studies? I want to merge the VCFs from the two, but the intersection between the common variants (MAF > 0.05) in the VCFs from "Diverse Cohorts + AMP 1.0" [syn68898203](syn68898203) and the WGS\_Harmonization [syn11707420](syn11707420) is quite small and I am not sure if it would be correct to merge the VCFs with such a small intersection. ## VCF Normalisation and MAF filter applied to both VCFs ```bash bcftools norm \ --no-version \ -m- \ -Ob \ --threads 4 \ -o "$SPLIT_VCF" \ "$INPUT_VCF" bcftools index "$SPLIT_VCF" echo echo "=========================" echo " Importing VCF into PLINK format..." echo "=========================" plink2 \ --bcf "$SPLIT_VCF" \ --set-all-var-ids "chr@_#_\$r_\$a_b38" \ --new-id-max-allele-len 1000 \ --make-pgen \ --out "${TEMP_DIR}/${BASE_NAME}" echo echo "=========================" echo " Applying MAF and HWE filters..." echo "=========================" plink2 --pfile "${TEMP_DIR}/${BASE_NAME}" \ --maf 0.05 \ --snps-only \ --rm-dup exclude-all \ --make-pgen \ --out "${TEMP_DIR}/${BASE_NAME}_filtered" plink2 --pfile "${TEMP_DIR}/${BASE_NAME}_filtered" \ --hwe 1e-6 midp \ --make-pgen \ --out "${TEMP_DIR}/${BASE_NAME}_hwe_filtered" plink2 --pfile "${TEMP_DIR}/${BASE_NAME}_hwe_filtered" \ --output-chr chrM \ --export bcf \ --out $OUTPUT_VCF ``` ## VCF intersection between DivCo\_HS and WGS\_Harmonization ```bash bcftools isec \ "$vcf1" "$vcf2" \ -p "$tmpdir/isec" VCF1_ONLY=$(bcftools view -H "$tmpdir/isec/0000.vcf" 2>/dev/null | wc -l) VCF2_ONLY=$(bcftools view -H "$tmpdir/isec/0001.vcf" 2>/dev/null | wc -l) SHARED=$(bcftools view -H "$tmpdir/isec/0002.vcf" 2>/dev/null | wc -l) TOTAL_VCF1=$(( VCF1_ONLY + SHARED )) TOTAL_VCF2=$(( VCF2_ONLY + SHARED )) OVERLAP_PCT=$(awk "BEGIN {printf \"%.2f\", 100*$SHARED/$TOTAL_VCF1}") echo "Overlap Results:" echo " VCF1 unique: $VCF1_ONLY" echo " VCF2 unique: $VCF2_ONLY" echo " Shared: $SHARED" echo " Overlap %: ${OVERLAP_PCT}%" ``` ### Sample output for chromosome 1: > Overlap Results: > VCF1 unique: 174922 > VCF2 unique: 41569 > Shared: 367566 > Overlap %: 67.76% Thank you. Best, Upamanyu

Overlap of samples with WGS and RNAseq in AMP-AD Diverse Cohorts page is loading…