Hello! Looking for feedback, independent validation, and potentially to warn others.
TLDR: Analysis of biological sex-specific gene expression indicates male and female cells mixed within each sample id (at least in our hands)
When pseudobulking gene expression by sample IDs, female-specific gene expression (XIST) and Y-chromosome genes (EIF1AY) co-occur, and in the raw counts, are directly correlated with each other. This is biologically impossible, and can't be attributed to batch effects. We had two separate people get the same results to cross-validate this concern.
The methods section states that demuxlet was used to demultiplex samples within each batch - it's fea[](url)sible that demuxlet failed and sample IDs were scrambled, leading to cells from biologically male and female samples being assigned to the same sample ID. This hypothesis is partially support by the fact each batch appears to have a relatively homogeneous number of cells expression XIST and EIF1AY. In other words, each batch had a different number of male and female subjects, and thus if demultiplexing failed, the resulting "sample IDs" would thus contain a relatively similar ratio of male and female cells, depending on the batch. This explanation appears to explain the data well, but this is just conjecture however. I've attached plots to show what we are seeing. Demuxlet would need to be re-run and evaluated, with the donors' genetic backgrounds, which we don't have access to. So we aren't able to re-analyzing this dataset with confidence at the moment.
We've gone through the typical channels to discuss this, and I think they are reviewing it. I just wanted to post this for the general community as a warning (and also to see if others have see this as well).
The dataset is still obviously quite useful for training models, celltyping, etc... but I would caution people from deriving biological insights from the data, given that we do not know which cells were derived from which donors.
If anyone has alternative explanations, a work around, or is unable to replicate our results, it would be great to hear from them!

Edit:
The ability to embed images of graphs doesn't seem to be working on this forum. I've attached a link to Imgur so that people can view the plots.
https://imgur.com/uADtkVY
https://imgur.com/v2WEUBA
https://imgur.com/fN3Fy5U
Created by Mark-Phillip Pebworth MPPebworth Hi @MPPebworth, would you be able to reply with the name or synapse ID of the dataset in question? This will help ensure that this information is clearly associated with the dataset in question. Thanks! Dear Mark-Phillip Pebworth,
Thanks for raising this and sharing your plots, which dataset in particular you are referring to? We are aware of putative issues with at least one [dataset](https://arkportal.synapse.org/Explore/Datasets/DetailsPage?id=63912719) currently and are working with the contributors to figure out the pertinent next steps to address these issues. Also the [ARK release page](https://news.arkportal.org/data-release/) contains any documented issues that we discover and the actions taken to remedy it, you can refer to for additional information about issues with datasets. With regard to demuxing using subject genotype data, the subject genotype information for AMP RA/SLE is actually available [here](https://arkportal.synapse.org/Explore/Datasets/DetailsPage?id=63424468).
Once you confirm which dataset has the issues you identified we can look further into that.
Feel free to follow up with additional questions or clarifications through our helpdesk as well. [https://sagebionetworks.jira.com/servicedesk/customer/portal/11](https://sagebionetworks.jira.com/servicedesk/customer/portal/11)
Regards,
Bishoy Kamel
Associate Director ยท Computational Immunology