Hi, I am trying to link the scRNA cells from `analysis_scRNAseq_tumor_counts.h5ad` to the metadata file `analysis_scRNAseq_tumor_metadata.tsv` to obtain the linked subjects, conform your guidelines: files: analysis_scRNAseq_tumor_counts.h5ad This file contains the post-filtered count matrices (pre-normalization) for the tumor samples (microenvironmental non-tumor + tumor cells) across the 11 subjects in this study. The sample and cell state annotation can be linked from this file to the cell_barcode variable in the table "analysis_scRNAseq_tumor_metadata.tsv". However, the cell names in the hda5 file do not match with the metadata and processed counts table: ``` > SeuratDisk::Convert("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_counts.h5ad", dest = "cache/analysis_scRNAseq_tumor_counts.h5", overwrite = TRUE) > d = SeuratDisk::LoadH5Seurat("cache/analysis_scRNAseq_tumor_counts.h5") > colnames(d)[c(1:5,55248)] "Cell1" "Cell2" "Cell3" "Cell4" "Cell5" "Cell55248" ``` ``` > m = read.csv("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_metadata.tsv",sep="\t") > m$cell_barcode[c(1:5,55248)] [1] "AAACCTGAGGACCACA-1-0" "AAACCTGCAAATACAG-1-0" "AAACCTGTCACTGGGC-1-0" "AAACGGGGTGAGTATA-1-0" "AAACGGGGTGCAACGA-1-0" "TTTACTGCACGACAGA-1-10" ``` @elang @kcjohnson Do you have suggestions how to link the subjects to the cells from `analysis_scRNAseq_tumor_counts.h5ad` ? Youri

Created by Youri Hoogstrate yhoogstrate
Hi @kcjohnson thank you so much for your prompt response and verification of the samples! Yes i'm trying to get only the IDH-m gliomas of astrocytic origin of any grade and am combining your data with some other public datasets so I really appreciate your clarification so that I can be confident that I have only those samples. Thanks again
Hi @dmiyagi , happy to provide clarification! It sounds like you are interested in analyzing IDHmut astrocytomas as defined by the 2021 WHO classification. If so, I would look at the "idh_mut_subtype" column in either the `clinical_metadata` table (Synapse) or STable1 for the 4 IDHmut-noncodels (SM002, SM008, SM015, and SM019). I am confident in these classifications given that we have WGS (driver mutations, CNVs) and methylation classification support. As for Extended Data Fig 6C, we've analyzed a number of sc/snRNA profiles of adult glioma tumors and have found that microenvironmental composition is indeed distinct by glioma subtype but it is also strongly influenced by both time point (initial vs recurrence), extent of surgical resection + spatial location of tissue sample, and a few other features. Thus, I would take caution with interpreting the distribution of normal cell types within this smaller dataset. -Kevin
@kcjohnson I'm sorry to be a major pain but I was finishing up the analysis and cross referencing with your paper. If I'm not mistaken, your STable1 has the corresponding ID to histological_classifications for the IDH astrocytomas: case_barcode histological_classification SM002 Anaplastic Astrocytoma SM008 Anaplastic astrocytoma SM015 Astrocytoma SM019 Diffuse Astrocytoma But I believe that the ID mapping file you gave had: SM019 : RV18001 SM008 : RV19005 In your Ex. Data 6C I see that SM019 and SM008 seem to have the highest % oligodentrocytes. I'm wondering if maybe the ID mapping might be incorrect or I might be misunderstanding one of the pieces of data? If i'm trying to only get IDH-m astrocytomas, should I include SM019 and SM008? or only SM004, SM001, SM015 and SM002 which is what it looks like in Ex. Data 6C ? Thank you so much!
Thank you @kcjohnson !! PS. welcome to Yale, I'm in the Gunel Lab
Hi @dmiyagi, Sorry for the delayed response - we're in the process of transitioning our lab to a new institution (JAX > Yale). Thank you for pointing out helpfulness of a direct link between the `case_barcode` present in the metadata and `sampleid` in the h5ad file. I have uploaded that file as "analysis_scRNA_id_mapping". Note that another way to link these features is to join based on the `cell_barcode` (analysis_scRNAseq_tumor_metadata.csv) and `index` (scanpy: analysis_scRNAseq_tumor_counts.h5ad) fields. Also, thanks to Youri ( @yhoogstrate ) for posting that helpful solution to converting for use in Seurat/R! -Kevin
Hi @yhoogstrate thank you so much! Although if i'm not mistaken, I believe your code is for R/Seurat? I am using Scanpy/Python. Of course I can go through and manually convert, but prefer to not make a mistake if @kcjohnson has the solution for Python because I tried his recommendation of adata.obs_names but I am seeing that the barcodes are as such: ``` Index(['AAACCTGAGGACCACA-1-0', 'AAACCTGCAAATACAG-1-0', 'AAACCTGTCACTGGGC-1-0', 'AAACGGGGTGAGTATA-1-0', 'AAACGGGGTGCAACGA-1-0', 'AAACGGGGTTGCTCCT-1-0', 'AAACGGGTCACAATGC-1-0', 'AAACGGGTCAGTTGAC-1-0', 'AAACGGGTCCCGGATG-1-0', 'AAAGATGGTTTGTTGG-1-0', ... 'TTTCGATTCCGTAATG-1-10', 'TTTGACTGTTCATCGA-1-10', 'TTTGATCCATACAGGG-1-10', 'TTTGATCTCCATTGGA-1-10', 'TTTGATCTCCTGCTAC-1-10', 'TTTGGAGCATCCTAAG-1-10', 'TTTGGAGGTACCTGTA-1-10', 'TTTGGTTCATACAGCT-1-10', 'TTTGGTTTCCGATCTC-1-10', 'TTTGGTTTCGAGTCTA-1-10'], dtype='object', name='index', length=55284) ``` can I get the conversion to "RV18001" to "RV19009" in the clinical data?
@dmiyagi - My code snipped does exactly that. Using the code below, all individual samples will be exported to separate folders, including the IDH-mut non-codels: ``` if(!file.exists("cache/analysis_scRNAseq_tumor_counts.h5")) { SeuratDisk::Convert("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_counts.h5ad", dest = "cache/analysis_scRNAseq_tumor_counts.h5", overwrite = TRUE) } seurat_obj <- SeuratDisk::LoadH5Seurat("cache/analysis_scRNAseq_tumor_counts.h5") rename <- read.table('cache/syn25956426_Johnson_cell_identifiers.txt', header=T) |> dplyr::rename(new.name = umi) |> dplyr::mutate(old.name = paste0("Cell",1:dplyr::n())) seurat_obj <- Seurat::RenameCells(seurat_obj, new.names = rename$new.name) metadata <- rename |> dplyr::left_join(read.csv("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_metadata.tsv", sep="\t"), by=c('new.name'='cell_barcode')) |> dplyr::left_join( read.csv('data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_syn25880693_clinical_metadata.csv'), by=c('case_barcode'='case_barcode') ) stopifnot(colnames(seurat_obj) == metadata$new.name) for(slot in c("cell_state","case_barcode","case_sex","tumor_location","laterality","driver_mutations","time_point","idh_codel_subtype","who_grade","histological_classification")) { seurat_obj[[slot]] <- metadata[[slot]] } # export individual samples ---- # split per individual and export for(slot in read.csv('data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_syn25880693_clinical_metadata.csv')$case_barcode) { #slot = "SM001" export <- subset(seurat_obj, case_barcode == slot) export.slim <- Seurat::DietSeurat(export, counts = TRUE, data = TRUE, scale.data = FALSE, features = NULL, assays=c('RNA'), dimreducs = NULL, graphs = NULL, misc = F ) export.slim@misc <- export.slim@meta.data[c('case_barcode', 'case_sex', 'tumor_location', 'laterality', 'driver_mutations', 'time_point', 'idh_codel_subtype', 'who_grade')] |> tibble::remove_rownames() |> dplyr::distinct() export.slim@meta.data <- export.slim@meta.data[c('cell_state')] DropletUtils::write10xCounts(path=paste0("cache/syn25956426_Johnson__split__",slot), x=export.slim@assays$RNA@counts) write.csv(data.frame( cell_barcode = colnames(export.slim), cell_state = export.slim@meta.data$cell_state ), file=paste0("cache/syn25956426_Johnson__split__",slot,"/cell_states.csv")) rm(export, export.slim) } ``` Resulting in: ``` $ tree syn* syn25956426_Johnson__split__SM001 ├── barcodes.tsv ├── cell_states.csv ├── genes.tsv └── matrix.mtx syn25956426_Johnson__split__SM002 ├── barcodes.tsv ├── cell_states.csv ├── genes.tsv └── matrix.mtx syn25956426_Johnson__split__SM004 ├── barcodes.tsv ├── cell_states.csv ├── genes.tsv └── matrix.mtx syn25956426_Johnson__split__SM006 ├── barcodes.tsv ├── cell_states.csv ├── genes.tsv └── matrix.mtx syn25956426_Johnson__split__SM008 ├── barcodes.tsv ├── cell_states.csv ├── genes.tsv └── matrix.mtx syn25956426_Johnson__split__SM011 ├── barcodes.tsv ├── cell_states.csv ├── genes.tsv └── matrix.mtx syn25956426_Johnson__split__SM012 ├── barcodes.tsv ├── cell_states.csv ├── genes.tsv └── matrix.mtx syn25956426_Johnson__split__SM015 ├── barcodes.tsv ├── cell_states.csv ├── genes.tsv └── matrix.mtx syn25956426_Johnson__split__SM017 ├── barcodes.tsv ├── cell_states.csv ├── genes.tsv └── matrix.mtx syn25956426_Johnson__split__SM018 ├── barcodes.tsv ├── cell_states.csv ├── genes.tsv └── matrix.mtx syn25956426_Johnson__split__SM019 ├── barcodes.tsv ├── cell_states.csv ├── genes.tsv └── matrix.mtx ```
Thanks @yhoogstrate ! I realize now I meant to tag @kcjohnson . @kcjohnson when I use scanpy to look at barcodes when using adata.obs_names, I see barcodes that don't seem to match. For example I see barcodes: TTTGGAGCATCCTAAG-1-10 , or AAACCTGAGGACCACA-1-0, but clinical metadata the cases are named like "SM001" or "SM019" whereas the adata object has "RV18001" to "RV19009" can you tell me how to convert these? I'm trying to only extract the IDH mutant so it's important that I get the right samples. Thanks so much!
@dmiyagi ``` if(!file.exists("cache/analysis_scRNAseq_tumor_counts.h5")) { SeuratDisk::Convert("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_counts.h5ad", dest = "cache/analysis_scRNAseq_tumor_counts.h5", overwrite = TRUE) } seurat_obj <- SeuratDisk::LoadH5Seurat("cache/analysis_scRNAseq_tumor_counts.h5") rename <- read.table('cache/syn25956426_Johnson_cell_identifiers.txt', header=T) |> dplyr::rename(new.name = umi) |> dplyr::mutate(old.name = paste0("Cell",1:dplyr::n())) seurat_obj <- Seurat::RenameCells(seurat_obj, new.names = rename$new.name) # load per cell metadata and per patient metadata (latter from Table 'clinical metadata', not file) metadata <- rename |> dplyr::left_join(read.csv("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_metadata.tsv", sep="\t"), by=c('new.name'='cell_barcode')) |> dplyr::left_join( read.csv('data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_syn25880693_clinical_metadata.csv'), by=c('case_barcode'='case_barcode') ) stopifnot(colnames(seurat_obj) == metadata$new.name) for(slot in c("cell_state","case_barcode","case_sex","tumor_location","laterality","driver_mutations","time_point","idh_codel_subtype","who_grade","histological_classification")) { seurat_obj[[slot]] <- metadata[[slot]] } ```
Hi Youri (@yhoogstrate), I wanted to piggy-back off this question. I am using scanpy and AnnData but I noticed that the "case_barcodes" column and the adata.obs["sampleid"] don't match. i.e. in the clinical metadata the cases are named like "SM001" or "SM019" whereas the adata object has "RV18001" to "RV19009" can you tell me how to convert these? or whether I should be using a different metadata column?
Dear Kevin, Thank you!! I wasn't aware these files were generated using python. I will try python / scanpy immediately. Thanks! Youri edit 1 --- I've exported the identifiers [on the assumption that the order matches with the R loader, will check this] : ``` from tqdm import tqdm import scanpy as sc path = "data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_counts.h5ad" h5 = sc.read_h5ad(path) with open("cache/syn25956426_Johnson_cell_identifiers.txt","w") as fh: fh.write("umi\n") for _ in tqdm(h5.obs_names): fh.write(str(_)+"\n") ``` edit 2 --- Identifiers match with the order according to UMAP. If someone else wants to load this file into R: ``` if(!file.exists("cache/analysis_scRNAseq_tumor_counts.h5")) { SeuratDisk::Convert("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_counts.h5ad", dest = "cache/analysis_scRNAseq_tumor_counts.h5", overwrite = TRUE) } seurat_obj <- SeuratDisk::LoadH5Seurat("cache/analysis_scRNAseq_tumor_counts.h5") rename <- read.table('cache/syn25956426_Johnson_cell_identifiers.txt', header=T) |> # this is the file exported with the python script from *edit 1* dplyr::rename(new.name = umi) |> dplyr::mutate(old.name = paste0("Cell",1:dplyr::n())) seurat_obj <- Seurat::RenameCells(seurat_obj, new.names = rename$new.name) metadata <- rename |> dplyr::left_join(read.csv("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_metadata.tsv", sep="\t"), by=c('new.name'='cell_barcode')) stopifnot(colnames(seurat_obj) == metadata$new.name) seurat_obj@meta.data$cell_state = metadata$cell_state seurat_obj@meta.data$case_barcode = metadata$case_barcode ```
Hi Youri, If I correctly understand your objective it's that you are looking to load the h5ad file generated via scanpy into R. I ran your commands and it looks like there was a warning message printed that indicates the issue: `Warning: No cell-level metadata present, creating fake cell names`. I assume that this occurred for you too given the generic cell names "Cell1" etc. I am not sure why that's happening as the other metadata is present. I just loaded the h5ad file into scanpy and can see the expected barcodes when using adata.obs_names. I tried looking for alternative solutions to easily load the h5ad file into R, but did not have any luck. If you are able to work with the data in scanpy, I think that's the easiest solution. Best, Kevin

.sg-noscript { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif; max-width: 860px; margin: 40px auto; padding: 0 24px; color: #141414; line-height: 1.6; } .sg-noscript h1 { font-size: 1.8rem; margin-bottom: 0.25rem; } .sg-noscript h2 { font-size: 1.2rem; margin-top: 2rem; margin-bottom: 0.5rem; border-bottom: 1px solid #e0e0e0; padding-bottom: 0.25rem; } .sg-noscript ul { padding-left: 1.5rem; } .sg-noscript li { margin-bottom: 0.4rem; } .sg-noscript a { color: #1a6fa8; } .sg-noscript address { font-style: normal; } .sg-noscript .note { margin-top: 2rem; color: #666; font-size: 0.85rem; }

Synapse — A Collaborative Platform for Open Biomedical Science

Synapse is a collaborative data-sharing and analysis platform built and operated by Sage Bionetworks, a 501(c)(3) nonprofit biomedical research organization based in Seattle, Washington.

About Sage Bionetworks

Sage Bionetworks is a nonprofit research organization whose mission is to drive a new age of discovery through truly open science and radical collaboration.

Our vision is to create a world where silos within and across science and technology no longer exist, forging a path to optimal human health.

We are a trusted leader in data sharing and reuse, enabling a rapid acceleration in biomedical discoveries and the transformation of medicine. Better Science Together is the principle that guides our work with researchers, clinicians, patient communities, and funders worldwide.

What Synapse Does

Synapse is the platform Sage Bionetworks uses to make biomedical research data findable, accessible, interoperable, and reusable (FAIR). Researchers, clinicians, and data scientists use Synapse to:

Share large biomedical datasets across institutions, with appropriate access controls, data-use agreements, and governance.
Run reproducible analyses on shared data with documented provenance.
Coordinate consortium science across disease areas including Alzheimer's disease, neurofibromatosis, ALS, rare cancers, and others.
Power public-facing knowledge portals such as the AD Knowledge Portal, the NF Data Portal, and the ALS Knowledge Portal.

Nonprofit Identity

Sage Bionetworks
A 501(c)(3) nonprofit research organization
EIN: 26-4489946
Seattle, Washington, USA
sagebionetworks.org
Trust Center — Terms of Service, Privacy Policy, financial statements, and governance documents

Learn More

This static content is provided for search engines and users with JavaScript disabled. For the full Synapse experience, please enable JavaScript in your browser.

Drop files to upload

Linking scRNA cells from hda5 file to the metadata file. page is loading…