bcgsc.ca_LAML_IlluminaGA_RNASeq.geneExp.whitelist_tumor

Created By Kyle Ellrott kellrott
======================================================================= BCCA GENOME SCIENCES CENTRE TCGA RNA-SEQ DATA FILE HEADER INFORMATION .gene.quantification.txt A tab-delimited text file containing the following fields: gene = Gene ID from GAF (version 2.0). The ID follows the nomenclature '|'. If the combination of the HUGO symbol and the Entrez ID is not unique, an additional 'NofM' descriptor is added. An ID with '?' indicates that there is no HUGO gene symbol available. e.g. ?|791120; SAP30L|79685|1of2 raw_counts = Sum of fraction of reads (rounded off to nearest integer - restricted by the RNA-seq validator) that mapped to collapsed transcripts representing a specific gene. Reads from pairs that did not pass Illumina?s Chastity filter or with mapping quality less than 10, i.e. reads that did not map uniquely, were excluded from calculation. median_length_normalized = Average coverage over all exons in the collapsed transcripts i.e. sum of the coverage depth at each base in all exons divided by the sum of the exon lengths RPKM = Reads per kilobase of exon per million. Calculation described in detail below. .exon.quantification.txt A tab-delimited text file containing the following fields: exon = Exon coordinates according to GAF (version 2.0) with the nomenclature, chr:-:. '.' in the indicates that there was no strand information available. e.g. chr10:120810487-120810613:. raw_counts = Sum of fraction of reads (rounded off to nearest integer - restricted by the RNA-seq validator) that mapped to an exon. Reads from pairs that did not pass Illumina?s Chastity filter or with mapping quality less than 10 were excluded from calculation. median_length_normalized = Average coverage over the exon i.e. the sum of the coverage depth at each base in an exon divided by the length of the exon. RPKM = Reads per kilobase of exon per million .spljxn.quantification.txt junction = Junction coordinates according to GAF (version 2.0) with the nomenclature, chr::,chr:: e.g. chr1:10016340:+,chr1:10054672:+ raw_counts = Sum of fraction of reads that mapped to exon-exon junctions. Reads from pairs that did not pass Illumina?s Chastity filter or with mapping quality less than 10 were excluded from calculation. ======================================================================= BCCA GENOME SCIENCES CENTRE TCGA RNA-SEQ ANALYIS ALGORITHM DESCRIPTION Genome+Junctions Alignment, Repositioning, and Merging Protocol Name: bcgsc.ca:alignment:IlluminaGA_RNASeq:01 Link: www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg18/bwa_ind/genome/tcga_ref/ Data Level: 1 Data File: *rnaseq.bam The Illumina paired-end sequence data were analyzed with BWA alignment software to map each read pair onto a genome+junction reference [1]. We aligned our sequence data to Hg18 (NCBI build 36.1). For Hg18 (NCBI build 36.1) alignments, the genomic sequence was restricted to chromosomes 1-22, M, X and Y. The exon-exon junction sequences and their corresponding coordinates were defined based on annotations of any transcripts in UCSC known genes, Ensembl (v54) or the Refseq database (as downloaded from the UCSC genome browser on March 2009). BWA's aln and sampe tools were used to generate small-gapped global alignments on the paired-end reads [2]. BWA (version 0.5.7) with the default parameter settings were used for alignments, except that the option (-s) to disable Smith-Waterman alignment was used because this feature was not designed to handle the insert size distribution that occurs in paired-end RNA-Seq data. In the final stage of the alignment, we set the record's flags. First, we took all reads that failed Illumina's Chastity filter and turned on bit 512 in that record's bitwise flag to indicate the read failed platform/vendor quality checks. Then we ran Picard's MarkDuplicates program (version 1.11) to set bit 1024 in the bitwise flag. After the alignment, the junction-aligned reads that mapped to exon-exon junctions were repositioned as large-gapped alignments in the genome based on the coordinates of the exons that were used to construct the junction sequences. Reads that aligned to junctions with insufficient overhang past the splice junction sites were changed to soft-clipped ungapped genomic alignments. All reads that were originally aligned to a junction sequence were tagged with "ZJ:Z". If the junction alignment was transformed into a gapped genomic alignment, a tag of the form, "ZJ:Z:-", was added where these two integers represented the coordinates of the first and the last base of the uncovered region. If the junction alignment was transformed into an ungapped genomic alignment, a tag of the form "ZJ:Z:G__" was added. If the junction alignment was determined to be of poor quality, it was converted to an "unmapped" status and various other forms of "ZJ:Z" tags were added to the record, depending on the source of the problem. When the target sequence depth was met, bam files generated from individual sequencing lanes were merged using samtools (version 0.1.8) into a single bam file representing the sequencing data of a TCGA sample. Finally, Picard's MarkDuplicates (version 1.11) was run again on the merged bam file. [1] Morin R. et al. Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques. 2008 Jul;45(1):81-94. [2] Li H. and Durbin R. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 2009; 25:1754-60. Gene Coverage Analysis Protocol Name: bcgsc.ca:RNA_Sequencing:IlluminaGA_RNASeq:01 Link: www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg18/bwa_ind/genome/tcga_ref/ Data Level: 3 Data File: *.gene.quantification.txt The gene coverage analysis was performed with our internal analysis pipeline version 1.0 using "composite" gene annotations from the hg18 (NCBI v36.1) version of the TCGA GAF v2.0. These composite gene models were created in February 2011 by UNC (with assistance from UCSC) based on the annotations in the "UCSC genes" database. Each composite gene annotation was generated by collapsing all transcripts of that gene into a single model such that exonic bases in a composite gene model were the union of exonic bases from all known transcripts of the gene. Thus, the locations of the exonic boundaries used for the gene coverage analysis were not based on a single canonical transcript for each gene. Consequently, the exonic boundaries in a composite gene model may not correspond to the actual boundaries of the expressed transcripts. For simplicity, throughout this document and in the gene coverage results files, a composite gene model is simply referred to as a gene, and it is associated with the id of the gene whose transcripts contributed to that composite model. To generate the raw read counts, we first counted the number of bases of each read that were inside exonic regions in a gene, and then divided this total base count by the read length. Thus our values for the raw number of reads were not whole numbers (i.e. if the entire 50bp read mapped to an exon, we would add 50 to the total base count, which would ultimately contribute 1 to the raw read count. However, if only 25 bases of the read's alignment fell within an exon's boundaries, the total base count would be incremented by 25, which would ultimately contribute 0.5 to the raw read count). In order to comply with the file format specification enforced by the DCC validator, our raw read counts are rounded to the closest whole number. A gene's raw read count is the sum of raw read counts for exons belonging to the gene. Gene coverage is its raw read count divided by the sum of its exon lengths. RPKM is calculated using the formula: (number of reads mapped to all exons in a gene x 1,000,000,000)/(NORM_TOTAL x sum of the lengths of all exons in the gene ) [Note: NORM_TOTAL = the total number of reads that are mapped to all exons from the composite gene models. (i.e. sum of the fractional read count for all exons)] If a read alignment contained a deletion or a large gap, the read did not contribute coverage inside the region spanned by the deletion/gap. Each of the paired end reads was counted separately. We excluded reads from pairs that failed Illumina's Chastity filter, as well as reads with mapping quality < 10. Exon Coverage Analysis Protocol Name: bcgsc.ca:RNA_Sequencing:IlluminaGA_RNASeq:01 Link: www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg18/bwa_ind/genome/tcga_ref/ Data Level: 3 Data File: *.exon.quantification.txt The exon coverage analysis was performed with our internal analysis pipeline version 1.0 using "composite" gene annotations from the hg18 (NCBI v36.1) version of the TCGA GAF v2.0. These composite gene models were created in February 2011 by UNC (with assistance from UCSC) based on the annotations in the "UCSC genes" database. Similarly to the gene coverage analysis, all transcripts of a given gene were collapsed into a single model such that exonic bases in a composite gene model were the union of exonic bases from all known transcripts of the gene. For simplicity, throughout this document and in the exon coverage results files, the collapsed exons are simply referred to as an exon. To generate the raw read counts, we first counted the number of bases of each read that were inside an exonic region, and then divided this total base count by the read length. Thus our values for the raw number of reads were not whole numbers (i.e. if the entire 50bp read mapped to an exon, we would add 50 to the total base count, which would ultimately contribute 1 to the raw read count. However, if only 25 bases of the read's alignment fell within an exon's boundaries, the total base count would be incremented by 25, which would ultimately contribute 0.5 to the raw read count). In order to comply with the file format specification enforced by the DCC validator, our raw read counts are rounded to the closest whole number. Exon coverage is the raw read count of an exon divided by its length. RPKM is calculated using the formula (number of reads (fractional) mapped to an exon x 1,000,000,000)/(NORM_TOTAL x length of an exon) [Note: NORM_TOTAL = the total number of reads (fractional) that mapped to exons, excluding those in the mitochondrial chromosome] If a read alignment contained a deletion or a large gap, the read did not contribute coverage inside the region spanned by the deletion/gap. Each of the paired end reads was counted separately. We excluded reads from pairs that failed Illumina's Chastity filter, as well as reads with mapping quality < 10. Junction Coverage Analysis Protocol Name: bcgsc.ca:RNA_Sequencing:IlluminaGA_RNASeq:01 Link: www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg18/bwa_ind/genome/tcga_ref/ Data Level: 3 Data File: *.spljxn.quantification.txt The junction coverage analysis counts all the read sequences that span an annotated exon-exon junction. For sequences that were aligned to Hg18 (NCBI 36.1), junction coordinates were defined based on the gene annotation from Ensembl, Refseq and known genes from the UCSC genome browser, which was downloaded on March 3, 2009. The counting script excluded reads from pairs that failed Illumina's Chastity filter.

freeze: tcga_pancancer_v4
acronym: LAML
disease: cancer
species: Homo sapiens
fileType: genomicMatrix
platform: IlluminaGA_RNASeq
whitelist: pancan12-whitelist-2012_12_18.whitelist
dataCenter: bcgsc.ca
dataSubType: geneExp
sampleSetType: tumor