unc.edu_UCEC_IlluminaGA_RNASeq.geneExp.whitelist_tumor

Created By Kyle Ellrott kellrott
UNC RNA-Seq Workflow - BWA Alignment to Transcriptome Date: 20110421 Authors: Sara Grimm Brian O'Connor Versions: This analysis was carried out using the SeqWare Pipeline project, version 0.7.0. The workflow was "RNASeqAlignmentBWA" version 0.7.4 (UNCID:23324) or 0.7.8 (UNCID:125637) depending on the sample along with "RNASeqQuantification" version 0.7.0 (UNCID:160921). UNC provides all our analysis software through this open source project. Users can download this software to run the identical RNA-Seq analysis described in the steps below. See the project website at http://seqware.sf.net for more information. The UNCIDs provided in file names are identifiers unique to UNC and can be used to provide data/analysis provenance tracking. Annotations: The Generic Annotation File (GAF) that provides all of our annotations for genes, exons, etc can be found at https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/other/GAF/GAF_bundle/outputs/TCGA.Sept2010.09202010.gaf A Note About GRCh37-lite & ChrM: We used the hg19/GRCh37-based "UCSC gene" track provided by UCSC in performing our transcriptome alignment. However, the TCGA has standardized on GRCh37-lite which is described as: "GRCh37-lite is a subset of the full GRCh37 human genome assembly (assembly accession GCA_000001405.1) plus the human mitochondrial genome reference sequence (the "rCRS") from Mitomap.org. This set of sequences excludes all the alternate loci scaffolds of the full GRCh37 assembly, and has the pseudo-autosomal regions (PARs) on chromosome Y masked with Ns. This haploid representation of the genome is provided as a convenience for use in alignment pipelines that cannot handle the multiple placements expected in the PARs and in regions of the genome that are represented by the alternate loci." This description was taken from: ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/special_requests/README.GRCh37-lite The hg19/GRCh37 genome the UCSC gene track is based on differs from GRCh37-lite in chrM: "Since the release of the UCSC hg19 assembly, the Homo sapiens mitochondrion sequence (represented as "chrM" in the Genome Browser) has been replaced in GenBank with the record NC_012920. We have not replaced the original sequence, NC_001807, in the hg19 Genome Browser. We plan to use the Revised Cambridge Reference Sequence (rCRS) in the next human assembly release." This description was taken from: http://genome.ucsc.edu/cgi-bin/hgGateway?org=Human&db=hg19 A future release of the GAF file will map to the TCGA sanctioned version of the chrM sequence and subsequent level 3 files will be prepared with reference to this version. In the mean time please consider this difference when interpreting these results. Conventions: Please note our spljxn.quantification.txt, exon.quantification.txt, and the GAF file above use the convention of chr:smaller_int-larger_int:+ for plus strand features and chr:larger_int-smaller_int:- for negative strand features. Carefully examine future versions of these annotation and quantification files since this convension is subject to change. Analysis Process: Step 1: Construct database of reference transcript sequences, composite gene models, composite exons, and splice junctions. The reference transcript set is based on the hg19 UCSC Gene standard track (December 2009 version), which is a publicly-available sequence set that can be downloaded from the UCSC Genome Browser (http://genome.ucsc.edu/). From this set, only sequences mapped to canonical human chromosomes (chr1-22, X, Y, and M) were retained. For each selected transcript, the nucleotide sequence, association to Entrez/LocusLink genes (if known), and CDS range (if known) were extracted from the UCSC database tables. Additionally, a pairwise alignment of each transcript against the hg19 genome was provided by Mark Diehkans from UCSC. The reference transcript set contains 73,671 human transcript sequences representing 20,532 human genes. 67,434 transcript sequences are given gene assignments according to the UCSC database tables; the remaining 6,237 sequences in the reference transcript set are not currently associated with an Entrez/LocusLink gene. No transcript is associated with more than one Entrez/LocusLink gene. For each gene represented by one or more transcripts in this set, a composite gene model was generated by merging all overlapping exons (as defined by the genomic mapping) from each associated reference transcript. Thus, each composite gene model is essentially the union of all associated reference transcripts. Each gene model is linked to a specific Entrez/LocusLink gene identifier. A library of composite exons was then established by collecting each contiguous genomic segment from each gene model. Note that because the gene model construction process involves merging overlapping exons from different transcripts, the resulting composite exons may or may not be observed in reference transcript sequences. This library contains 239,886 unique composite exons, each of which is defined by its genomic range. A library of all known splice junctions was established by cataloging all junctions observed in the reference transcript sequences. Inclusion of a splice junction in this library indicates that it has been observed in (at least) one reference transcript. However, not all splice junctions in this library will be observed in the composite gene models. This library contains 249,775 unique splice junctions, where each junction is defined by the last position of exon N and the first position of exon N+1. Step 2: Prepare raw sequencing reads for alignment. The ?illumina2srf? tool from the DNA Sequence Read Toolkit (http://sourceforge.net/projects/sequenceread/) is used to convert the raw sequencing data from the vendor-specific format to standard SRF (sequence read format), which is a preferred compressed format for storing sequencing reads. Before data processing begins, the ?srf2fastq? tool from the Staden Package (http://staden.sourceforge.net/) is utilized to convert the data from SRF to the FASTQ format, using the ?-C? parameter to simultaneously filter poor quality reads. In some cases, the insert size of the sequenced fragment is shorter than the read length, resulting in part or all of the adapter (primer) sequence being incorporated into the output read. Since any read containing adapter sequence is highly unlikely to align to a biological reference database, all reads are further processed to trim any adapter segments from the sequences. The criteria for identifying adapter sequence within a read are as follows: (a) the read contains an exact match with the adapter sequence, (b) the sequence match begins at the first base of the adapter, (c) the sequence match continues until either the end of the adapter or the end of the read, and (d) the sequence match is at least 5 bases in length. If all criteria are met, the read is trimmed starting at the first base of the adapter sequence match. Step 3: Align sequencing reads against reference transcript database. A pre-processed lane of sequencing reads is then submitted to the BWA algorithm (http://bio-bwa.sourceforge.net/) for alignment against the reference transcript database using default parameters. The resulting SAM file is converted to BAM format using Picard tools (http://picard.sourceforge.net/). Step 4: Calculate quantification at the transcript level. The reads aligned to the reference transcript sequences are used to determine transcript level quantification. Three quantification values are reported: raw read counts, coverage, and RPKM. Raw read counts is simply the number of reads aligned to a given reference transcript. Raw read counts are normalized by transcript length to give coverage. For a given TranscriptX, coverage is calculated by: total bases aligned to TranscriptX / length of TranscriptX. The ?total bases aligned? will typically be equal to aligned reads × read length. However, in cases where reads have been trimmed due to adapter contamination, this value may be slightly lower. Raw read counts are normalized by both transcript length and overall lane yield to give RPKM. For a given TranscriptX, RPKM is calculated by: (raw read counts × 10^9) / (total reads × length of TranscriptX). For this calculation, ?total reads? is the lane yield after removing poor quality reads. Step 5: Calculate quantification at the gene level. Quantification at the gene level is determined by collapsing the transcript level quantifications onto their associated gene loci. Raw read counts, coverage, and RPKM are reported for each gene. Raw read counts for a given GeneX is the sum of reads aligned to all transcripts associated with that gene. Coverage for a given GeneX is determined by: sum of bases aligned to all transcripts associated with GeneX / length of GeneX. RPKM for a given GeneX is calculated by: (raw read counts × 10^9) / (total reads × length of GeneX). ?Total reads? is the lane yield after removing poor quality reads. In both the coverage and RPKM calculations, the length of GeneX is defined as the median length of all transcripts associated with GeneX. Step 6: Calculate quantification at the exon level. In order to carry quantification for sequence features defined by their genomic locus (i.e. exons and splice junctions), the aligned reads must first be translated from transcript coordinates to genomic coordinates. The pre-established pairwise mapping between each reference transcript and the hg19 genome allows for a straightforward conversion between these two coordinate systems. These converted aligned reads are then used to determine exon level quantification. Raw base counts, coverage, and RPKM are reported for each entry in the library of composite exons. The raw base counts for a given ExonX is the total number of bases aligned to that genomic segment. Raw base counts are used instead of raw read counts because in many cases only a portion of a read will align to a given exon. For a given ExonX, coverage is calculated by: raw base counts / exon length. RPKM for a given ExonX is determined by: ( (raw base counts / median read length) × 10^9) / (total reads × exon length). Except in cases where the raw reads have undergone extensive adapter trimming, the median read length will be equal to the experimental read length for that lane. ?Total reads? is the lane yield after removing poor quality reads. Step 7: Calculate quantification at the splice junction level. Quantification at the splice junction level is also calculated based on the aligned reads converted to genomic coordinates. The only reported value for this level is raw read counts, which is determined by the number of reads that cross a particular junction. Because a splice junction has an effective length of zero, the coverage and RPKM calculations do not apply. Step 8: Generate visualization of coverage. To facilitate visualization of read coverage across genomic segments of interest, the quantification data is converted to the UCSC bigWig format to enable viewing in the UCSC Genome Browser with down to single-base resolution. Step 9: Evaluate QC metrics. Each lane is evaluated according to a variety of both pre- and post-alignment QC measures in order to determine whether the data set in question falls within expected quality thresholds. The following metrics are assessed on a per-lane basis: Total read counts Reads passing filter Percent of reads that are unique Base quality per cycle Nucleotide distribution per cycle Percent of reads requiring adapter trimming Mean effective read length after adapter trimming Percent of reads aligning to human reference transcripts Percent of reads aligning to human rRNA Percent of reads aligning to human miRNA Percent of reads aligning to human mtDNA Percent of reads aligning to the human genome Percent of reads aligning to other genomes (mouse, rat, fruit fly, Arabidopsis, yeast) Percent of reads aligning to viral genomes Average coverage per human gene Average coverage across the length of reference transcripts Column Headers: These are just brief descriptions of the column headers you will find in the various level 3 files. File: *.trimmed.annotated.gene.quantification.txt gene: This is the Entrez/LocusLink gene symbol followed by the Entrez/LocusLink gene ID. raw_counts: The number of reads mapping to this gene. median_length_normalized: This is the total aligned bases to all transcript models associated with this gene divided by the mean transcript length. RPKM: See the DESCRIPTION.txt file in the mage-tab bunlde for information on how this is calculated. File: *.trimmed.annotated.exon.quantification.txt exon: This is the location of the exon in hg19 (GRCh37) based on the UCSC Gene standard track (December 2009 version). raw_counts: The number of reads mapping to this exon. median_length_normalized: This is the total aligned bases to this exon divided by the exon length. RPKM: See the DESCRIPTION.txt file in the mage-tab bunlde for information on how this is calculated. File: *.trimmed.annotated.spljxn.quantification.txt This file does not include normalized counts since splice junctions are a fixed size. junction: This is the location of the splice junction in hg19 (GRCh37) based on the UCSC Gene standard track (December 2009 version). raw_counts: The number of reads mapping to this splice junction. File: *.wig This is a WIG file format that represents coverage, see http://genome.ucsc.edu/FAQ/FAQformat.html#format6 for more information.

freeze: tcga_pancancer_v4
acronym: UCEC
disease: cancer
species: Homo sapiens
fileType: genomicMatrix
platform: IlluminaGA_RNASeq
whitelist: pancan12-whitelist-2012_12_18.whitelist
dataCenter: unc.edu
lastUpdate: 2012-10-12
tissueType: endometrial
dataSubType: geneExp
lastUpdated: 2012-10-12
sampleSetType: tumor