The barcode-individual pairing file does not include all barcodes in the count matrix. (ROSMAP, snRNAseq

Hi, I am trying to match the cells in the count matrix to individuals using the barcodes. However, I found that the ROSMAP_snRNAseq_demultiplexed_ID_mapping.csv (syn34572333) does not cover all the barcodes present in the count matrix for the corresponding batch. For example, in the 190403-B4-A batch (syn51121931), there are 22,180 barcodes in the count matrix, but 4,124 of these barcodes cannot be found in the 190403-B4-A batch in the ROSMAP_snRNAseq_demultiplexed_ID_mapping.csv. This is quite strange. Could you please help? Thank you very much.

Created by zhijiehan103012
You are correct, after loading all files, I could see all 437 individuals (a bit more). Thank you!
I believe the count matrices are grouped by library batch and contain several individuals in each matrix. The filenames look like the `libraryBatch` column in the barcode -> individual mapping file. When you load all the matrices and map all the column names back to individuals, are there still individuals missing?
Thank you, @jaclynbeck, for the response. I also noticed that the count matrix folder does not contain all 437 donors remaining in the post-analysis atlas but only 127 matrixes. Do you have any advice on how to get the full dataset? Best regards, Thomas
Hello @taagbaedeng, you will have to combine a few different files to get this information. First, syn34572333 has the mapping between the column names in the count matrices (cellBarcode) to individuals (individualID). Note that this file only contains barcodes for cells that were confidently mapped to a single individual, so it may not contain all barcodes in the matrices. Next, syn3191087 has the clinical metadata for each individualID. That file has clinical diagnosis, pathology, and some cognitive measures. I'm not sure how much other information on co-morbidities was collected, but if you need variables that don't appear in the clinical metadata file on Synapse, you will need to request them directly from [Rush/RADC](https://www.radc.rush.edu/). Hopefully though, combining these two files will be enough information for you to work with! Let me know if you have any more questions.
Hi @masashi, thank you very much for this rich resource. I am trying to identify the clinical diagnosis and comorbidities of the individuals in **Single Nucleus RNAseq - DLPFC, Experiment 2 (N = 465)**. The metadata available don't seem to have this information. Do you know how I could retrieve the diagnoses and/or clinical information for count matrixes (syn51123521)? Can you help? Thank you very much!
Thank you, @jaclynbeck and @masashi. Your response was really helpful. Thank you very much
Hi @jaclynbeck and @alejandra_danae_cortes14 200310-B18-A and 200310-B18-B are two distinct libraries prepared from different aliquotes of the same nuclei suspention. Each of the A and B libraries were sequenced at Broad and NYGC. CellRanger count of the A library was performed by combining FASTQs of Broad and NYGC. Therefore, the [count matrices](https://synapse.org/Synapse:syn51123521) already contain UMI counts from Broad and NYGC. The B library was processed in the same way.
Sorry for the second post, my previous post did not tag people properly. @masashi, can you clarify @alejandra_danae_cortes14's question above about whether technical replicates were aggregated by CellRanger, or left separated?
@alejandra_danae_cortes14 I don't believe the technical replicates were aggregated, but I'm not 100% sure. Based on the [preprint](https://www.biorxiv.org/content/10.1101/2022.11.07.515446v1.full) paper, it sounds like CellRanger was run on each batch separately, which would imply no aggregation between replicates. Perhaps the author (@masashi) can clarify on this point?
@paulinapglz99 I think I figured out a way but it's not super straightforward. All of the [count matrices](syn51123521) you linked are annotated with libraryBatch and sequencingBatch, so you can do: 1. Make a dataframe with the barcodes in each counts matrix, plus the sequencingBatch and libraryBatch on the file each barcode comes from, 2. Merge #1 with the pairing of barcodes/individualIDs from the demultiplexing file, on the `libraryBatch` and `cellBarcode` fields, 3. Merge the [biospecimen](https://synapse.org/Synapse:syn21323366) and [assay](https://synapse.org/Synapse:syn21073536) metadata files together on the `specimenID` field, and 4. Merge #3 and #4 together on the `individualID`, `libraryBatch`, and `sequencingBatch` fields That will let you associate the `platformLocation` with each specific barcode. **Getting annotations** I'm not sure of your level of familiarity with using `synapseclient` (Python) or `synapser` (R) to get files, but when you download a file using these libraries, the annotations are already in the file object you get back. So you can fairly easily get the sequencing/libraryBatch for each file that way. I believe going to the [count matrices](syn51123521) folder, clicking "Download Options" -> "Programmatic Options" and running the code in your preferred language will also return an object with all the annotations as a data frame, but I haven't tested that. Let me know if you need any help with this or if the merging I suggested doesn't do what you need! Jaclyn Beck
Hello :) , I have a specific question regarding the generation of the Count Matrices provided on Synapse. I understand that for some specimenIDs, there were technical replicates sequenced at different centers (Broad and NYGC). According to what I have read in forums and regarding Cell Ranger, these technical replicates should be merged during the cellranger count process to correctly handle UMI deduplication. My question is: Has this merging of technical replicates (e.g., combining Broad and NYGC runs) already been performed to generate the current Count Matrices? Specifically regarding the files with "A" and "B" suffixes (e.g., 190403-B4-A and 190403-B4-B): Does the existence of these separate files imply that "A" and "B" are distinct biological samples where the technical replicates have already been aggregated? Or do "A" and "B" themselves represent the technical replicates that still need to be merged? I want to ensure I am not incorrectly treating technical replicates as separate biological samples in my downstream analysis. Thanks for the clarification!
Hello! We are exploring the data and have noticed that there are technical duplicates, meaning that for some specimenIDs there are two sequencing batches, according to ROSMAP_assay_scrnaSeq_metadata.csv (syn21073536). We downloaded the count matrix data (syn51123521) and demultiplexed it with the information from ROSMAPsnRNAseqdemultiplexedIDmapping.csv (syn34572333). We read that in ROSMAP_biospecimen_metadata.csv (syn21323366) we could find where each count matrix comes from, but the column specimenIdSource is empty. Do you know how we can differentiate where the specimenIDs come from in terms of sequencing? Example: specimenID individualID sequencingBatch libraryBatch platformLocation 1 201021-B60-B_R5061712 R5061712 HCVG2CCX2 201021-B60-B Broad 2 201021-B60-B_R5061712 R5061712 HVN23DSXY 201021-B60-B NYGC Thanks in advance!
That's a good question. If you're looking for actual methods information, you will have to ask @masashi for that information. I did find where the sequencing center information is: [syn21073536](https://www.synapse.org/Synapse:syn21073536) contains scRNA-seq assay metadata, and has a `platformLocation` field which should say which sequencing center each sample was processed at. To map `specimenID` in this file to the `individualID` in the barcode mapping file, you will need [syn21323366](https://www.synapse.org/Synapse:syn21323366), which has both columns. I hope that helps!
Thank you, @jaclynbeck. I have another question. How do you handle technical duplicates from different sequencing centers (Broad and NYGC) when quantifying with CellRanger? I noticed that while batch information is included, the sequencing center (Broad vs. NYGC) is not specified in the counts matrix.
Hello, If I recall, the ID mapping file only contains barcodes for cells that could be confidently mapped to a single individual. Cells that were ambiguous, didn’t map, or failed some QC, might exist in the counts matrix but won’t exist in the ID file, so the extra cells can probably be discarded from the matrix. You may want to double-check with the contributor ( @masashi ) just in case though. I hope that helps! Jaclyn Beck

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

The barcode-individual pairing file does not include all barcodes in the count matrix. (ROSMAP, snRNAseq - DLPFC, Experiment 2, syn31512863) page is loading…

Drop files to upload

The barcode-individual pairing file does not include all barcodes in the count matrix. (ROSMAP, snRNAseq - DLPFC, Experiment 2, syn31512863) page is loading…