... from biospecimen (syn18345334) and individual (syn18345335) metadata files
Here are the specimens that I could not find:
58, 44, 42, 57, 46, 35, 54, 56, 30271, 30269, 30270, 251, 116
I also noticed that the processed data for this study, specifically htseqcounts_APTR.txt (syn22107627) contains 234 specimens. But only 144 BAMs are provided on synapse. Do some BAMs contain reads from more than one specimen?
Thank you again for all your help.
Created by Rached Alkallas ralkallas Hi all. I encountered similar issues. Is there any update on this issue? Thanks. Hi @ryaxley ,
Have you heard back from the JAX team regarding this?
Thank you! Thank you,
I also noticed that in addition to the samples I mentioned above, many mice seem to be missing genotypes:
```
> si2[is.na(genotype)][order(individualID), ]
individualID specimenID genotype genotypeBackground study
1: 3 3rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
2: 8 8rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
3: 12 12rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
4: 25 25rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
5: 27 27rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
6: 30 30rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
7: 123 123rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
8: 181 181rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
9: 191 191rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
10: 193 193rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
11: 221 221rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
12: 225 225rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
13: 239 239rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
14: 242 242rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
15: 252 252rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
16: 272 272rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
17: 299 299rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
18: 344 344rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
19: 347 347rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
20: 364 364rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
21: 367 367rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
22: 393 393rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
23: 394 394rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
24: 399 399rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
25: 400 400rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
26: 20246 20246rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
27: 20248 20248rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
28: 27168 27168rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
29: 27354 27354rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
30: 27355 27355rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
31: 27356 27356rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
32: 27357 27357rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
33: 28129 28129rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
34: 288810708 288810708rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
35: 289457705 289457705rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
36: 289461928 289461928rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
37: 289470196 289470196rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
38: 289478142 289478142rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
39: 289482201 289482201rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
40: 289494346 289494346rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
41: 289535121 289535121rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
42: 289576914 289576914rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
43: 289666353 289666353rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
44: 289674340 289674340rh C57BL6J Jax.IU.Pitt_APOE4.Trem2.R47H
```
If it makes it easier for you to narrow down these samples in your own records, here is my R code, which takes as input the directory containing the fastq and metadata files for this study:
```
library(data.table)
library(magrittr)
experiments <- system('find /scratch/user/models -name "*fastq*"', intern = T) %>% sort
experiments <- experiments[ -grep('single cell RNA seq', experiments) ]
experiments <- split(
gsub('(.+/)|(_R(1|2)_001.fastq.gz|_001_R(1|2).fastq.gz|_R(1|2).fastq.gz)', '', experiments),
sapply(strsplit(experiments, '/'), '[', 5)
) %>% lapply(., unique)
Jax.IU.Pitt_APOE4.Trem2.R47H <- lapply(system('find /scratch/user/models/Jax.IU.Pitt_APOE4.Trem2.R47H/Metadata -name "*.csv" | egrep "RNA|biospecimen|individual"', intern = T) %>% setNames(nm = .), fread)
names(Jax.IU.Pitt_APOE4.Trem2.R47H) <- names(Jax.IU.Pitt_APOE4.Trem2.R47H) %>% gsub('^.+/', '', .) %>% gsub('\\.csv$', '', .) %>% gsub('Jax.IU.Pitt_APOE4.Trem2.R47H_|_metadata|asssay_', '', .)
# > Jax.IU.Pitt_APOE4.Trem2.R47H$biospecimen[ , table(specimenID %>% gsub('[0-9]+', '', .), tissue), ]
# tissue
# right cerebral hemisphere serum
# 0 428
# rh 406 0
all(paste0(Jax.IU.Pitt_APOE4.Trem2.R47H$RNAseq$specimenID, 'rh') %in% Jax.IU.Pitt_APOE4.Trem2.R47H$biospecimen$specimenID)
sum(paste0(Jax.IU.Pitt_APOE4.Trem2.R47H$RNAseq$specimenID, 'rh') %in% Jax.IU.Pitt_APOE4.Trem2.R47H$biospecimen$specimenID)
Jax.IU.Pitt_APOE4.Trem2.R47H$RNAseq$specimenID[!paste0(Jax.IU.Pitt_APOE4.Trem2.R47H$RNAseq$specimenID, '') %in% Jax.IU.Pitt_APOE4.Trem2.R47H$biospecimen$specimenID]
Jax.IU.Pitt_APOE4.Trem2.R47H$RNAseq$specimenID[!paste0(Jax.IU.Pitt_APOE4.Trem2.R47H$RNAseq$specimenID, 'rh') %in% Jax.IU.Pitt_APOE4.Trem2.R47H$biospecimen$specimenID]
# [1] 58 44 42 57 46 35 54 56 30271 30269 30270 251 116
JIP_APOE4.Trem2.R47H.sampInfo <- merge(Jax.IU.Pitt_APOE4.Trem2.R47H$individual, Jax.IU.Pitt_APOE4.Trem2.R47H$biospecimen, by = 'individualID', all = T)
Jax.IU.Pitt_APOE4.Trem2.R47H$RNAseq[ , specimenID.ori := specimenID, ]
Jax.IU.Pitt_APOE4.Trem2.R47H$RNAseq[ , specimenID := paste0(specimenID, 'rh'), ]
JIP_APOE4.Trem2.R47H.sampInfo <- merge(JIP_APOE4.Trem2.R47H.sampInfo, Jax.IU.Pitt_APOE4.Trem2.R47H$RNAseq, by = 'specimenID', all = T)
# intersect with rna sample names
experiments$Jax.IU.Pitt_APOE4.Trem2.R47H
extracted.id <- paste0(gsub('_.+$', '', experiments$Jax.IU.Pitt_APOE4.Trem2.R47H), 'rh')
extracted.id %>% duplicated %>% sum
JIP_APOE4.Trem2.R47H.sampInfo[ , all(extracted.id %in% specimenID), ]
experiments$Jax.IU.Pitt_APOE4.Trem2.R47H <- setNames(extracted.id, experiments$Jax.IU.Pitt_APOE4.Trem2.R47H)
JIP_APOE4.Trem2.R47H.sampInfo[!is.na(specimenID), genotype.filled := unique(genotype[genotype != '']) %>% na.omit, by = .(individualID), ]
si2 <- JIP_APOE4.Trem2.R47H.sampInfo[
specimenID %in% experiments$Jax.IU.Pitt_APOE4.Trem2.R47H,
.(individualID, specimenID, genotype = genotype.filled, genotypeBackground, study = 'Jax.IU.Pitt_APOE4.Trem2.R47H')
] %>% unique
si2[is.na(genotype), ][order(individualID), ]
```
Thank you for the clear explanation of the issue. I will contact the JAX team and inquire about the missing metadata and BAM files.
Rich
Drop files to upload
Specimens in Jax.IU.Pitt_APOE4.Trem2.R47H mouse study RNAseq metadata file (syn18345333) are missing... page is loading…