Hello! Looking for feedback, independent validation, and potentially to warn others.
TLDR: Analysis of biological sex-specific gene expression indicates male and female cells mixed within each sample id (at least in our hands)
When pseudobulking gene expression by sample IDs, female-specific gene expression (XIST) and Y-chromosome genes (EIF1AY) co-occur, and in the raw counts, are directly correlated with each other. This is biologically impossible, and can't be attributed to batch effects. We had two separate people get the same results to cross-validate this concern.
The methods section states that demuxlet was used to demultiplex samples within each batch - it's fea[](url)sible that demuxlet failed and sample IDs were scrambled, leading to cells from biologically male and female samples being assigned to the same sample ID. This hypothesis is partially support by the fact each batch appears to have a relatively homogeneous number of cells expression XIST and EIF1AY. In other words, each batch had a different number of male and female subjects, and thus if demultiplexing failed, the resulting "sample IDs" would thus contain a relatively similar ratio of male and female cells, depending on the batch. This explanation appears to explain the data well, but this is just conjecture however. I've attached plots to show what we are seeing. Demuxlet would need to be re-run and evaluated, with the donors' genetic backgrounds, which we don't have access to. So we aren't able to re-analyzing this dataset with confidence at the moment.
We've gone through the typical channels to discuss this, and I think they are reviewing it. I just wanted to post this for the general community as a warning (and also to see if others have see this as well).
The dataset is still obviously quite useful for training models, celltyping, etc... but I would caution people from deriving biological insights from the data, given that we do not know which cells were derived from which donors.
If anyone has alternative explanations, a work around, or is unable to replicate our results, it would be great to hear from them!

Edit:
The ability to embed images of graphs doesn't seem to be working on this forum. I've attached a link to Imgur so that people can view the plots.
https://imgur.com/uADtkVY
https://imgur.com/v2WEUBA
https://imgur.com/fN3Fy5U
Created by Mark-Phillip Pebworth MPPebworth Ah, I see. Thank you for your work!
BAM files are typically an output from CellRanger and they also would not be impacted by the sample_id issue. This is a high quality dataset with a (hopefully) very fixable metadata issue. I'm looking forward to when the corrected version is available!
Hi @MPPebworth,
Thank you for your follow up and for specifying to which dataset you were referring. We have temporarily removed access to the AMP RA/SLE PBMC CITEseq dataset to prevent erroneous research results. At this time we are working closely with the contributors to make a corrected version of the data available.
Unfortunately there are no BAM files available that can be provided. BAM files were not included in the initial contribution nor are they available from the contributors at this time. Sorry for the late reply here - it was the AMP RA/SLE PBMC CITEseq dataset (63912719), but it looks like the entire dataset was scrubbed, including the BAM files. Based on Bishoy's response, the files available under the genome section are exactly what we need to re-run demultiplexing on the BAM files and hopefully fix the cell id vs sample id issue.
Would it be possible to put at least the BAM files back up so the community has a chance to try to fix it and confirm? When reviewing the ARK portal, there's no files present at all for that study. https://www.synapse.org/Synapse:syn64495007
Hi @MPPebworth, would you be able to reply with the name or synapse ID of the dataset in question? This will help ensure that this information is clearly associated with the dataset in question. Thanks! Dear Mark-Phillip Pebworth,
Thanks for raising this and sharing your plots, which dataset in particular you are referring to? We are aware of putative issues with at least one [dataset](https://arkportal.synapse.org/Explore/Datasets/DetailsPage?id=63912719) currently and are working with the contributors to figure out the pertinent next steps to address these issues. Also the [ARK release page](https://news.arkportal.org/data-release/) contains any documented issues that we discover and the actions taken to remedy it, you can refer to for additional information about issues with datasets. With regard to demuxing using subject genotype data, the subject genotype information for AMP RA/SLE is actually available [here](https://arkportal.synapse.org/Explore/Datasets/DetailsPage?id=63424468).
Once you confirm which dataset has the issues you identified we can look further into that.
Feel free to follow up with additional questions or clarifications through our helpdesk as well. [https://sagebionetworks.jira.com/servicedesk/customer/portal/11](https://sagebionetworks.jira.com/servicedesk/customer/portal/11)
Regards,
Bishoy Kamel
Associate Director ยท Computational Immunology