Hi, I am trying to match the cells in the count matrix to individuals using the barcodes. However, I found that the ROSMAP_snRNAseq_demultiplexed_ID_mapping.csv (syn34572333) does not cover all the barcodes present in the count matrix for the corresponding batch. For example, in the 190403-B4-A batch (syn51121931), there are 22,180 barcodes in the count matrix, but 4,124 of these barcodes cannot be found in the 190403-B4-A batch in the ROSMAP_snRNAseq_demultiplexed_ID_mapping.csv. This is quite strange. Could you please help? Thank you very much.
Created by zhijiehan103012 Thank you, @jaclynbeck and @masashi. Your response was really helpful. Thank you very much Hi @jaclynbeck and @alejandra_danae_cortes14
200310-B18-A and 200310-B18-B are two distinct libraries prepared from different aliquotes of the same nuclei suspention. Each of the A and B libraries were sequenced at Broad and NYGC. CellRanger count of the A library was performed by combining FASTQs of Broad and NYGC. Therefore, the [count matrices](https://synapse.org/Synapse:syn51123521) already contain UMI counts from Broad and NYGC. The B library was processed in the same way. Sorry for the second post, my previous post did not tag people properly.
@masashi, can you clarify @alejandra_danae_cortes14's question above about whether technical replicates were aggregated by CellRanger, or left separated? @alejandra_danae_cortes14 I don't believe the technical replicates were aggregated, but I'm not 100% sure. Based on the [preprint](https://www.biorxiv.org/content/10.1101/2022.11.07.515446v1.full) paper, it sounds like CellRanger was run on each batch separately, which would imply no aggregation between replicates.
Perhaps the author (@masashi) can clarify on this point? @paulinapglz99 I think I figured out a way but it's not super straightforward.
All of the [count matrices](syn51123521) you linked are annotated with libraryBatch and sequencingBatch, so you can do:
1. Make a dataframe with the barcodes in each counts matrix, plus the sequencingBatch and libraryBatch on the file each barcode comes from,
2. Merge #1 with the pairing of barcodes/individualIDs from the demultiplexing file, on the `libraryBatch` and `cellBarcode` fields,
3. Merge the [biospecimen](https://synapse.org/Synapse:syn21323366) and [assay](https://synapse.org/Synapse:syn21073536) metadata files together on the `specimenID` field, and
4. Merge #3 and #4 together on the `individualID`, `libraryBatch`, and `sequencingBatch` fields
That will let you associate the `platformLocation` with each specific barcode.
**Getting annotations**
I'm not sure of your level of familiarity with using `synapseclient` (Python) or `synapser` (R) to get files, but when you download a file using these libraries, the annotations are already in the file object you get back. So you can fairly easily get the sequencing/libraryBatch for each file that way.
I believe going to the [count matrices](syn51123521) folder, clicking "Download Options" -> "Programmatic Options" and running the code in your preferred language will also return an object with all the annotations as a data frame, but I haven't tested that.
Let me know if you need any help with this or if the merging I suggested doesn't do what you need!
Jaclyn Beck
Hello :) ,
I have a specific question regarding the generation of the Count Matrices provided on Synapse.
I understand that for some specimenIDs, there were technical replicates sequenced at different centers (Broad and NYGC). According to what I have read in forums and regarding Cell Ranger, these technical replicates should be merged during the cellranger count process to correctly handle UMI deduplication.
My question is: Has this merging of technical replicates (e.g., combining Broad and NYGC runs) already been performed to generate the current Count Matrices?
Specifically regarding the files with "A" and "B" suffixes (e.g., 190403-B4-A and 190403-B4-B): Does the existence of these separate files imply that "A" and "B" are distinct biological samples where the technical replicates have already been aggregated? Or do "A" and "B" themselves represent the technical replicates that still need to be merged?
I want to ensure I am not incorrectly treating technical replicates as separate biological samples in my downstream analysis.
Thanks for the clarification!
Hello!
We are exploring the data and have noticed that there are technical duplicates, meaning that for some specimenIDs there are two sequencing batches, according to ROSMAP_assay_scrnaSeq_metadata.csv (syn21073536). We downloaded the count matrix data (syn51123521) and demultiplexed it with the information from ROSMAPsnRNAseqdemultiplexedIDmapping.csv (syn34572333).
We read that in ROSMAP_biospecimen_metadata.csv (syn21323366) we could find where each count matrix comes from, but the column specimenIdSource is empty.
Do you know how we can differentiate where the specimenIDs come from in terms of sequencing?
Example:
specimenID individualID sequencingBatch libraryBatch platformLocation
1 201021-B60-B_R5061712 R5061712 HCVG2CCX2 201021-B60-B Broad
2 201021-B60-B_R5061712 R5061712 HVN23DSXY 201021-B60-B NYGC
Thanks in advance! That's a good question. If you're looking for actual methods information, you will have to ask @masashi for that information.
I did find where the sequencing center information is:
[syn21073536](https://www.synapse.org/Synapse:syn21073536) contains scRNA-seq assay metadata, and has a `platformLocation` field which should say which sequencing center each sample was processed at.
To map `specimenID` in this file to the `individualID` in the barcode mapping file, you will need [syn21323366](https://www.synapse.org/Synapse:syn21323366), which has both columns.
I hope that helps! Thank you, @jaclynbeck. I have another question. How do you handle technical duplicates from different sequencing centers (Broad and NYGC) when quantifying with CellRanger? I noticed that while batch information is included, the sequencing center (Broad vs. NYGC) is not specified in the counts matrix.
Hello,
If I recall, the ID mapping file only contains barcodes for cells that could be confidently mapped to a single individual. Cells that were ambiguous, didn’t map, or failed some QC, might exist in the counts matrix but won’t exist in the ID file, so the extra cells can probably be discarded from the matrix. You may want to double-check with the contributor ( @masashi ) just in case though.
I hope that helps!
Jaclyn Beck
Drop files to upload
The barcode-individual pairing file does not include all barcodes in the count matrix. (ROSMAP, snRNAseq - DLPFC, Experiment 2, syn31512863) page is loading…