unc.edu_BLCA_IlluminaHiSeq_RNASeqV2.geneExp

Created By Kyle Ellrott kellrott
?######################### V1_BWAtoTranscriptome, V1_RNASeqQuantification: UNC V1 RNA-Seq Workflow - BWA Alignment to Transcriptome Date: 20101108 Authors: Sara Grimm Brian O'Connor Versions: This analysis was carried out using the SeqWare Pipeline project, version 0.7.0. The workflow was "RNASeqAlignmentBWA" version 0.7.x. UNC provides all our analysis software through this open source project. Users can download this software to run the identical RNA-Seq analysis described in the steps below. See the project website at http://seqware.sf.net for more information. The UNCIDs provided in file names are identifiers unique to UNC and can be used to provide data/analysis provenance tracking. Annotations: The Generic Annotation File (GAF) that provides all of our annotations for genes, exons, etc can be found at https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/other/GAF/GAF_bundle/outputs/TCGA.Sept2010.09202010.gaf Conventions: Please note our spljxn.quantification.txt, exon.quantification.txt, and the GAF file above use the convention of chr:smaller_int-larger_int:+ for plus strand features and chr:larger_int-smaller_int:- for negative strand features. Carefully examine future versions of these annotation and quantification files since this convension is subject to change. Analysis Process: Step 1: Construct database of reference transcript sequences, composite gene models, composite exons, and splice junctions. The reference transcript set is based on the hg19 UCSC Gene standard track (December 2009 version), which is a publicly-available sequence set that can be downloaded from the UCSC Genome Browser (http://genome.ucsc.edu/). From this set, only sequences mapped to canonical human chromosomes (chr1-22, X, Y, and M) were retained. For each selected transcript, the nucleotide sequence, association to Entrez/LocusLink genes (if known), and CDS range (if known) were extracted from the UCSC database tables. Additionally, a pairwise alignment of each transcript against the hg19 genome was provided by Mark Diehkans from UCSC. The reference transcript set contains 73,671 human transcript sequences representing 20,532 human genes. 67,434 transcript sequences are given gene assignments according to the UCSC database tables; the remaining 6,237 sequences in the reference transcript set are not currently associated with an Entrez/LocusLink gene. No transcript is associated with more than one Entrez/LocusLink gene. For each gene represented by one or more transcripts in this set, a composite gene model was generated by merging all overlapping exons (as defined by the genomic mapping) from each associated reference transcript. Thus, each composite gene model is essentially the union of all associated reference transcripts. Each gene model is linked to a specific Entrez/LocusLink gene identifier. A library of composite exons was then established by collecting each contiguous genomic segment from each gene model. Note that because the gene model construction process involves merging overlapping exons from different transcripts, the resulting composite exons may or may not be observed in reference transcript sequences. This library contains 239,886 unique composite exons, each of which is defined by its genomic range. A library of all known splice junctions was established by cataloging all junctions observed in the reference transcript sequences. Inclusion of a splice junction in this library indicates that it has been observed in (at least) one reference transcript. However, not all splice junctions in this library will be observed in the composite gene models. This library contains 249,775 unique splice junctions, where each junction is defined by the last position of exon N and the first position of exon N+1. Step 2: Prepare raw sequencing reads for alignment. The Illumina Casava 1.8.1 sofatware is used to process raw images and output fastq files for each sample. CASAVA also performs de-multiplexing of pooled samples when appropriate. In some cases, the insert size of the sequenced fragment is shorter than the read length, resulting in part or all of the adapter (primer) sequence being incorporated into the output read. Reads are further processed to trim any adapter segments using the fastq-mcf software (http://code.google.com/p/ea-utils/). The following arguments were used for fastq-mcf: -l 25 -f -P 33 -m 10 -k 0 -q 0. This instructs the trimmer to require at least 10 bases prior to matching, and remove read pairs where either member of the pair is less 25 bases after trimming. Quality based trimming is disabled. In addition, only adaptors that are found in at least 250 of 100,000 randomly sampled reads will be considered during trimming. Step 3: Align sequencing reads against reference transcript database. A pre-processed lane of sequencing reads is then submitted to the BWA algorithm (http://bio-bwa.sourceforge.net/) for alignment against the reference transcript database using default parameters. The resulting SAM file is converted to BAM format using Picard tools (http://picard.sourceforge.net/). Step 4: Calculate quantification at the transcript level. The reads aligned to the reference transcript sequences are used to determine transcript level quantification. Three quantification values are reported: raw read counts, coverage, and RPKM. Raw read counts is simply the number of reads aligned to a given reference transcript. Raw read counts are normalized by transcript length to give coverage. For a given TranscriptX, coverage is calculated by: total bases aligned to TranscriptX / length of TranscriptX. The ?total bases aligned? will typically be equal to aligned reads × read length. However, in cases where reads have been trimmed due to adapter contamination, this value may be slightly lower. Raw read counts are normalized by both transcript length and overall lane yield to give RPKM. For a given TranscriptX, RPKM is calculated by: (raw read counts × 10^9) / (total reads × length of TranscriptX). For this calculation, ?total reads? is the lane yield after removing poor quality reads. Step 5: Calculate quantification at the gene level. Quantification at the gene level is determined by collapsing the transcript level quantifications onto their associated gene loci. Raw read counts, coverage, and RPKM are reported for each gene. Raw read counts for a given GeneX is the sum of reads aligned to all transcripts associated with that gene. Coverage for a given GeneX is determined by: sum of bases aligned to all transcripts associated with GeneX / length of GeneX. RPKM for a given GeneX is calculated by: (raw read counts × 10^9) / (total reads × length of GeneX). ?Total reads? is the lane yield after removing poor quality reads. In both the coverage and RPKM calculations, the length of GeneX is defined as the median length of all transcripts associated with GeneX. Step 6: Calculate quantification at the exon level. In order to carry quantification for sequence features defined by their genomic locus (i.e. exons and splice junctions), the aligned reads must first be translated from transcript coordinates to genomic coordinates. The pre-established pairwise mapping between each reference transcript and the hg19 genome allows for a straightforward conversion between these two coordinate systems. These converted aligned reads are then used to determine exon level quantification. Raw base counts, coverage, and RPKM are reported for each entry in the library of composite exons. The raw base counts for a given ExonX is the total number of bases aligned to that genomic segment. Raw base counts are used instead of raw read counts because in many cases only a portion of a read will align to a given exon. For a given ExonX, coverage is calculated by: raw base counts / exon length. RPKM for a given ExonX is determined by: ( (raw base counts / median read length) × 10^9) / (total reads × exon length). Except in cases where the raw reads have undergone extensive adapter trimming, the median read length will be equal to the experimental read length for that lane. ?Total reads? is the lane yield after removing poor quality reads. Step 7: Calculate quantification at the splice junction level. Quantification at the splice junction level is also calculated based on the aligned reads converted to genomic coordinates. The only reported value for this level is raw read counts, which is determined by the number of reads that cross a particular junction. Because a splice junction has an effective length of zero, the coverage and RPKM calculations do not apply. Step 8: Generate visualization of coverage. To facilitate visualization of read coverage across genomic segments of interest, the quantification data is converted to the UCSC bigWig format to enable viewing in the UCSC Genome Browser with down to single-base resolution. Step 9: Evaluate QC metrics. Each lane is evaluated according to a variety of both pre- and post-alignment QC measures in order to determine whether the data set in question falls within expected quality thresholds. The following metrics are assessed on a per-lane basis: Total read counts Reads passing filter Percent of reads that are unique Base quality per cycle Nucleotide distribution per cycle Percent of reads requiring adapter trimming Mean effective read length after adapter trimming Percent of reads aligning to human reference transcripts Percent of reads aligning to human rRNA Percent of reads aligning to human miRNA Percent of reads aligning to human mtDNA Percent of reads aligning to the human genome Percent of reads aligning to other genomes (mouse, rat, fruit fly, Arabidopsis, yeast) Percent of reads aligning to viral genomes Average coverage per human gene Average coverage across the length of reference transcripts Column Headers: These are just brief descriptions of the column headers you will find in the various level 3 files. File: *.trimmed.annotated.gene.quantification.txt gene: This is the Entrez/LocusLink gene symbol followed by the Entrez/LocusLink gene ID. raw_counts: The number of reads mapping to this gene. median_length_normalized: This is the total aligned bases to all transcript models associated with this gene divided by the mean transcript length. RPKM: See the DESCRIPTION.txt file in the mage-tab bunlde for information on how this is calculated. File: *.trimmed.annotated.exon.quantification.txt exon: This is the location of the exon in hg19 (GRCh37) based on the UCSC Gene standard track (December 2009 version). raw_counts: The number of reads mapping to this exon. median_length_normalized: This is the total aligned bases to this exon divided by the exon length. RPKM: See the DESCRIPTION.txt file in the mage-tab bunlde for information on how this is calculated. File: *.trimmed.annotated.spljxn.quantification.txt This file does not include normalized counts since splice junctions are a fixed size. junction: This is the location of the splice junction in hg19 (GRCh37) based on the UCSC Gene standard track (December 2009 version). raw_counts: The number of reads mapping to this splice junction. File: *.wig This is a WIG file format that represents coverage, see http://genome.ucsc.edu/FAQ/FAQformat.html#format6 for more information. ######################### V2_MapSpliceRSEM: UNC V2 RNA-Seq Workflow - MapSplice genome alignment and RSEM estimation of GAF 2.1 Date: 05-10-2012 Contacts: Lisle Mose Joel Parker Versions: This analysis was carried out using the SeqWare Pipeline project, version 0.7.0. The workflow was "MapspliceRSEM" version 0.7.x. UNC provides all our analysis software through this open source project. Annotations: The Generic Annotation File (GAF) that provides all of our annotations for genes, exons, etc can be found at https://tcga-data.nci.nih.gov/docs/GAF/GAF.hg19.June2011.bundle/outputs/TCGA.hg19.June2011.gaf Conventions: Please note our junction_quantification.txt, exon_quantification.txt, and the GAF file above use the convention of chr:smaller_int-larger_int:+ for plus strand features and chr:larger_int-smaller_int:- for negative strand features. Carefully examine future versions of these annotation and quantification files since this convension is subject to change. Analysis Process: Step 1: Prepare raw sequencing reads for alignment. The Illumina Casava 1.8.1 sofatware is used to process raw images and output fastq files for each sample. CASAVA also performs de-multiplexing of pooled samples when appropriate. Step 3: Align sequencing reads against reference transcript database. A pre-processed lane of sequencing reads is then submitted to the Mapsplice algorithm (http://www.netlab.uky.edu/p/bioinfo/MapSplice2UserGuide) for alignment against the hg19 reference genome database using default parameters. Step 4: Calculate quantification at the gene and transcript level. The reads aligned to the reference genome sequences are translated to transcriptome coordinates prior to transcript level quantification. This step is performed using the UNC Bioinformatics Utilities (UBU) software (https://github.com/mozack/ubu). RSEM is used to estimate gene and transcript abundances (http://deweylab.biostat.wisc.edu/rsem/rsem-calculate-expression.html). These values are normalized to a fixed upper quaritile value of 1000 for gene and 300 for transcript level estimates and the normalized values are placed in a separate file. Step 5: Calculate quantification at the exon level. Raw base counts, coverage, and RPKM are reported for each entry in the library of composite exons. The raw base counts for a given ExonX is the total number of bases aligned to that genomic segment. Raw base counts are used instead of raw read counts because in many cases only a portion of a read will align to a given exon. For a given ExonX, coverage is calculated by: bases covered / exon length. RPKM for a given ExonX is determined by: ( (raw base counts / read length) × 10^9) / (total reads × exon length). Step 6: Calculate quantification at the splice junction level. The only reported value for this level is raw read counts, which is determined by the number of reads that cross a particular junction. Because a splice junction has an effective length of zero, the coverage and RPKM calculations do not apply. Step 7: Evaluate QC metrics. Each lane is evaluated according to a variety of both pre- and post-alignment QC measures in order to determine whether the data set in question falls within expected quality thresholds. The following metrics are assessed on a per-lane basis: Total read counts Reads passing filter Percent of reads that are unique Base quality per cycle Nucleotide distribution per cycle Percent of reads requiring adapter trimming Mean effective read length after adapter trimming Percent of reads aligning to human reference transcripts Percent of reads aligning to human rRNA Percent of reads aligning to human miRNA Percent of reads aligning to human mtDNA Percent of reads aligning to the human genome Percent of reads aligning to other genomes (mouse, rat, fruit fly, Arabidopsis, yeast) Percent of reads aligning to viral genomes Average coverage per human gene Average coverage across the length of reference transcripts

freeze: tcga_pancancer_v4
acronym: BLCA
disease: cancer
species: Homo sapiens
fileType: genomicMatrix
platform: IlluminaHiSeq_RNASeqV2
dataCenter: unc.edu
lastUpdate: 2012-10-16
tissueType: bladder
dataSubType: geneExp
lastUpdated: 2012-10-16