Clinical Project IDs not all 8 digits

Hi, I could be incorrect but the RosMap Clinical (syn3191087) data appears to have 8 digit identifiers for individual. However there are individuals whose ID appears to start with a zero and in the clinical data file this leading zero is missing/has been trimmed preventing a programatic match when integrating with biospecimin, or assay meta data. Please see programatic repo below that is relevant for the SRM proteomics: ``` # SRM Proteomics - SNCA ~ rs2245801-C ##install.packages("synapser", repos = c("http://ran.synapse.org", "http://cran.fhcrc.org")) # Load SRM experimental data srm <- read.table( synapser::synGet( 'syn21137026' )$path, header = T, sep = '\t' ) srm <- srm[,c( 'Replicate.Name', 'plate', 'plate_row', 'plate_col', 'plate_well', 'subject.id', 'isControl', 'sample.type', 'SNCA') ] srm_dat <- read.table( synapser::synGet( 'syn10468858' )$path, header = T, sep = '\t', row.names = 1 ) # RosMap SRM Assay Data assay <- read.csv( synapser::synGet( 'syn23569441' )$path, header = T#, row.names = 1 ) # RosMap Clinical Data clin <- read.csv( synapser::synGet( 'syn3191087' )$path, header = T#, row.names = 1 ) # RosMap Biospecimin Data biospec <- read.csv( synapser::synGet( 'syn21323366' )$path, header = T#, row.names = 1 ) # Select gene of interest srm$Peptide.Sequence <- srm_dat[ 'BIN1_3', ]$Peptide.Sequence srm$Peptide.Note <- srm_dat[ 'BIN1_3', ]$Peptide.Note srm$Peptide.Modified.Sequence <- srm_dat[ 'BIN1_3', ]$Peptide.Modified.Sequence srm$sd.tech <- srm_dat[ 'BIN1_3', ]$sd.tech srm$sd.full <- srm_dat[ 'BIN1_3', ]$sd.full srm$med_area_light <- srm_dat[ 'BIN1_3', ]$med_area_light srm$specie <- srm_dat[ 'BIN1_3', ]$specie # Re-formet column names colnames(srm) <- gsub( '[.]', '_', colnames(srm) ) # Select Assay biospec <- biospec[ biospec$assay %in% 'label free mass spectrometry', ] outlier <- biospec[ !is.na( biospec$exclude ), ]$specimenID biospec <- biospec[ !( biospec$specimenID %in% 'control' ), ] # Remove Outlier and controls srm <- srm[ !(srm$subject_id %in% 'control'), ] srm <- srm[ !(srm$subject_id %in% outlier),] biospec <- biospec[ !(biospec$specimenID %in% outlier),] # Identify leading zero issues issues <- srm$subject_id[ !( srm$subject_id %in% clin$projid ) ] #Identify the trimed indv in the main clinical file: clin[ clin$projid %in% gsub( "^0", '', issues), ] ```

Created by Jake Gockley jgockley
Weird looks like an automatic R issue with read.csv()
Hello @jgockley, Just to make sure I'm understanding correctly, is it in the projid field in the clinical file (syn3191087) that you are finding an inconsistent number of characters? I looked at the file and I'm seeing a consistent number of characters (8) in each line for projid, including leading zeros. I also didn't find anything odd in the individual ID's. It might be possible that R is automatically removing these leading zeros when you load them into your data frame (If I'm understanding correctly). I checked the number of characters in the projids with this command: ``` cat ROSMAP_clinical.csv | awk -F ',' '{print $1}' | tr -d '"' | awk '{print length($0)}' ``` Best, Will

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Clinical Project IDs not all 8 digits page is loading…