![](image-url)Hello! Looking for feedback, independent validation, and potentially to warn others. TLDR: Analysis of biological sex-specific gene expression indicates male and female cells mixed within each sample id (at least in our hands) When pseudobulking gene expression by sample IDs, female-specific gene expression (XIST) and Y-chromosome genes (EIF1AY) co-occur, and in the raw counts, are directly correlated with each other. This is biologically impossible, and can't be attributed to batch effects. We had two separate people get the same results to cross-validate this concern. The methods section states that demuxlet was used to demultiplex samples within each batch - it's fea[](url)sible that demuxlet failed and sample IDs were scrambled, leading to cells from biologically male and female samples being assigned to the same sample ID. This hypothesis is partially support by the fact each batch appears to have a relatively homogeneous number of cells expression XIST and EIF1AY. In other words, each batch had a different number of male and female subjects, and thus if demultiplexing failed, the resulting "sample IDs" would thus contain a relatively similar ratio of male and female cells, depending on the batch. This explanation appears to explain the data well, but this is just conjecture however. I've attached plots to show what we are seeing. Demuxlet would need to be re-run and evaluated, with the donors' genetic backgrounds, which we don't have access to. So we aren't able to re-analyzing this dataset with confidence at the moment. We've gone through the typical channels to discuss this, and I think they are reviewing it. I just wanted to post this for the general community as a warning (and also to see if others have see this as well). The dataset is still obviously quite useful for training models, celltyping, etc... but I would caution people from deriving biological insights from the data, given that we do not know which cells were derived from which donors. If anyone has alternative explanations, a work around, or is unable to replicate our results, it would be great to hear from them! ![Each batch appears to be homogenous in XIST/EIF1AY expression](https://imgur.com/uADtkVY)![XIST and EIF1AY for a continuum, not discrete populations, which is unrelated to biological sex. ](https://imgur.com/fN3Fy5U)![Raw pseudobulk counts (by sample ID) for EIF1AY and XIST are directly correlated with each other and do not form discrete male/female clusters.](https://imgur.com/v2WEUBA) Edit: The ability to embed images of graphs doesn't seem to be working on this forum. I've attached a link to Imgur so that people can view the plots. https://imgur.com/uADtkVY https://imgur.com/v2WEUBA https://imgur.com/fN3Fy5U

Created by Mark-Phillip Pebworth MPPebworth
Hi @MPPebworth, We are very happy to announce the fixed AMP RA/SLE PBMC CITE-seq data are now available on the ARK Portal. This data has been re-released as two new dataset products: AMP SLE Phase II PBMC CITE-seq and AMP RA Phase II PBMC CITE-seq. The previous version has been archived. All the details, including what's changed and links to the new datasets, are available in the release notes: https://news.arkportal.org/news-release/
Ah, I see. Thank you for your work! BAM files are typically an output from CellRanger and they also would not be impacted by the sample_id issue. This is a high quality dataset with a (hopefully) very fixable metadata issue. I'm looking forward to when the corrected version is available!
Hi @MPPebworth, Thank you for your follow up and for specifying to which dataset you were referring. We have temporarily removed access to the AMP RA/SLE PBMC CITEseq dataset to prevent erroneous research results. At this time we are working closely with the contributors to make a corrected version of the data available. Unfortunately there are no BAM files available that can be provided. BAM files were not included in the initial contribution nor are they available from the contributors at this time.
Sorry for the late reply here - it was the AMP RA/SLE PBMC CITEseq dataset (63912719), but it looks like the entire dataset was scrubbed, including the BAM files. Based on Bishoy's response, the files available under the genome section are exactly what we need to re-run demultiplexing on the BAM files and hopefully fix the cell id vs sample id issue. Would it be possible to put at least the BAM files back up so the community has a chance to try to fix it and confirm? When reviewing the ARK portal, there's no files present at all for that study. https://www.synapse.org/Synapse:syn64495007
Hi @MPPebworth, would you be able to reply with the name or synapse ID of the dataset in question? This will help ensure that this information is clearly associated with the dataset in question. Thanks!
Dear Mark-Phillip Pebworth, Thanks for raising this and sharing your plots, which dataset in particular you are referring to? We are aware of putative issues with at least one [dataset](https://arkportal.synapse.org/Explore/Datasets/DetailsPage?id=63912719) currently and are working with the contributors to figure out the pertinent next steps to address these issues. Also the [ARK release page](https://news.arkportal.org/data-release/) contains any documented issues that we discover and the actions taken to remedy it, you can refer to for additional information about issues with datasets. With regard to demuxing using subject genotype data, the subject genotype information for AMP RA/SLE is actually available [here](https://arkportal.synapse.org/Explore/Datasets/DetailsPage?id=63424468). Once you confirm which dataset has the issues you identified we can look further into that. Feel free to follow up with additional questions or clarifications through our helpdesk as well. [https://sagebionetworks.jira.com/servicedesk/customer/portal/11](https://sagebionetworks.jira.com/servicedesk/customer/portal/11) Regards, Bishoy Kamel Associate Director · Computational Immunology

.sg-noscript { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif; max-width: 860px; margin: 40px auto; padding: 0 24px; color: #141414; line-height: 1.6; } .sg-noscript h1 { font-size: 1.8rem; margin-bottom: 0.25rem; } .sg-noscript h2 { font-size: 1.2rem; margin-top: 2rem; margin-bottom: 0.5rem; border-bottom: 1px solid #e0e0e0; padding-bottom: 0.25rem; } .sg-noscript ul { padding-left: 1.5rem; } .sg-noscript li { margin-bottom: 0.4rem; } .sg-noscript a { color: #1a6fa8; } .sg-noscript address { font-style: normal; } .sg-noscript .note { margin-top: 2rem; color: #666; font-size: 0.85rem; }

Synapse — A Collaborative Platform for Open Biomedical Science

Synapse is a collaborative data-sharing and analysis platform built and operated by Sage Bionetworks, a 501(c)(3) nonprofit biomedical research organization based in Seattle, Washington.

About Sage Bionetworks

Sage Bionetworks is a nonprofit research organization whose mission is to drive a new age of discovery through truly open science and radical collaboration.

Our vision is to create a world where silos within and across science and technology no longer exist, forging a path to optimal human health.

We are a trusted leader in data sharing and reuse, enabling a rapid acceleration in biomedical discoveries and the transformation of medicine. Better Science Together is the principle that guides our work with researchers, clinicians, patient communities, and funders worldwide.

What Synapse Does

Synapse is the platform Sage Bionetworks uses to make biomedical research data findable, accessible, interoperable, and reusable (FAIR). Researchers, clinicians, and data scientists use Synapse to:

Share large biomedical datasets across institutions, with appropriate access controls, data-use agreements, and governance.
Run reproducible analyses on shared data with documented provenance.
Coordinate consortium science across disease areas including Alzheimer's disease, neurofibromatosis, ALS, rare cancers, and others.
Power public-facing knowledge portals such as the AD Knowledge Portal, the NF Data Portal, and the ALS Knowledge Portal.

Nonprofit Identity

Sage Bionetworks
A 501(c)(3) nonprofit research organization
EIN: 26-4489946
Seattle, Washington, USA
sagebionetworks.org
Trust Center — Terms of Service, Privacy Policy, financial statements, and governance documents

Learn More

This static content is provided for search engines and users with JavaScript disabled. For the full Synapse experience, please enable JavaScript in your browser.

Drop files to upload

Feedback appreciated: Sample IDs look scrambled? Biologically impossible gene expression page is loading…