Hi,
I could be incorrect but the RosMap Clinical (syn3191087) data appears to have 8 digit identifiers for individual. However there are individuals whose ID appears to start with a zero and in the clinical data file this leading zero is missing/has been trimmed preventing a programatic match when integrating with biospecimin, or assay meta data. Please see programatic repo below that is relevant for the SRM proteomics:
```
# SRM Proteomics - SNCA ~ rs2245801-C
##install.packages("synapser", repos = c("http://ran.synapse.org", "http://cran.fhcrc.org"))
# Load SRM experimental data
srm <- read.table( synapser::synGet( 'syn21137026' )$path,
header = T, sep = '\t'
)
srm <- srm[,c( 'Replicate.Name', 'plate', 'plate_row', 'plate_col', 'plate_well', 'subject.id', 'isControl', 'sample.type', 'SNCA') ]
srm_dat <- read.table( synapser::synGet( 'syn10468858' )$path,
header = T, sep = '\t', row.names = 1
)
# RosMap SRM Assay Data
assay <- read.csv( synapser::synGet( 'syn23569441' )$path,
header = T#, row.names = 1
)
# RosMap Clinical Data
clin <- read.csv( synapser::synGet( 'syn3191087' )$path,
header = T#, row.names = 1
)
# RosMap Biospecimin Data
biospec <- read.csv( synapser::synGet( 'syn21323366' )$path,
header = T#, row.names = 1
)
# Select gene of interest
srm$Peptide.Sequence <- srm_dat[ 'BIN1_3', ]$Peptide.Sequence
srm$Peptide.Note <- srm_dat[ 'BIN1_3', ]$Peptide.Note
srm$Peptide.Modified.Sequence <- srm_dat[ 'BIN1_3', ]$Peptide.Modified.Sequence
srm$sd.tech <- srm_dat[ 'BIN1_3', ]$sd.tech
srm$sd.full <- srm_dat[ 'BIN1_3', ]$sd.full
srm$med_area_light <- srm_dat[ 'BIN1_3', ]$med_area_light
srm$specie <- srm_dat[ 'BIN1_3', ]$specie
# Re-formet column names
colnames(srm) <- gsub( '[.]', '_', colnames(srm) )
# Select Assay
biospec <- biospec[ biospec$assay %in% 'label free mass spectrometry', ]
outlier <- biospec[ !is.na( biospec$exclude ), ]$specimenID
biospec <- biospec[ !( biospec$specimenID %in% 'control' ), ]
# Remove Outlier and controls
srm <- srm[ !(srm$subject_id %in% 'control'), ]
srm <- srm[ !(srm$subject_id %in% outlier),]
biospec <- biospec[ !(biospec$specimenID %in% outlier),]
# Identify leading zero issues
issues <- srm$subject_id[ !( srm$subject_id %in% clin$projid ) ]
#Identify the trimed indv in the main clinical file:
clin[ clin$projid %in% gsub( "^0", '', issues), ]
```
Created by Jake Gockley jgockley Weird looks like an automatic R issue with read.csv()
Hello @jgockley,
Just to make sure I'm understanding correctly, is it in the projid field in the clinical file (syn3191087) that you are finding an inconsistent number of characters? I looked at the file and I'm seeing a consistent number of characters (8) in each line for projid, including leading zeros. I also didn't find anything odd in the individual ID's. It might be possible that R is automatically removing these leading zeros when you load them into your data frame (If I'm understanding correctly). I checked the number of characters in the projids with this command:
```
cat ROSMAP_clinical.csv | awk -F ',' '{print $1}' | tr -d '"' | awk '{print length($0)}'
```
Best,
Will
Drop files to upload
Clinical Project IDs not all 8 digits page is loading…