bcgsc.ca_OV_IlluminaHiSeq_RNASeq.geneExp.whitelist_tumor

Created By Kyle Ellrott kellrott
======================================================================= BCCA GENOME SCIENCES CENTRE TCGA RNA-SEQ DATA FILE HEADER INFORMATION DCC Level 2 data: The TCGA VCF specification for TCGA data is explained at the URL: https://wiki.nci.nih.gov/display/TCGA/TCGA+Variant+Call+Format+%28VCF%29+1.1+Specification The TCGA VCF specification is based on the VCF version 4.1 from the 1000 Genomes project explained at the URL: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 .snv.vcf A tab-delimited text file containing putative single nucleotide variants identified from RNA-seq data in the VCF format. .indel.vcf A tab-delimited text file containing putative indels identified from RNA-seq data in the VCF format. DCC Level 3 data: .gene.quantification.txt A tab-delimited text file containing the following fields: gene = Gene ID from GAF (version 2.0). The ID follows the nomenclature '|'. If the combination of the HUGO symbol and the Entrez ID is not unique, an additional 'NofM' descriptor is added. An ID with '?' indicates that there is no HUGO gene symbol available. e.g. ?|791120; SAP30L|79685|1of2 raw_counts = Sum of fraction of reads (rounded off to nearest integer - restricted by the RNA-seq validator) that mapped to collapsed transcripts representing a specific gene. Reads from pairs that did not pass Illumina?s Chastity filter or with mapping quality less than 10, i.e. reads that did not map uniquely, were excluded from calculation. median_length_normalized = Average coverage over all exons in the collapsed transcripts i.e. sum of the coverage depth at each base in all exons divided by the sum of the exon lengths RPKM = Reads per kilobase of exon per million. Calculation described in detail below. .exon.quantification.txt A tab-delimited text file containing the following fields: exon = Exon coordinates according to GAF (version 2.0) with the nomenclature, chr:-:. '.' in the indicates that there was no strand information available. e.g. chr10:120810487-120810613:. raw_counts = Sum of fraction of reads (rounded off to nearest integer - restricted by the RNA-seq validator) that mapped to an exon. Reads from pairs that did not pass Illumina?s Chastity filter or with mapping quality less than 10 were excluded from calculation. median_length_normalized = Average coverage over the exon i.e. the sum of the coverage depth at each base in an exon divided by the length of the exon. RPKM = Reads per kilobase of exon per million .spljxn.quantification.txt junction = Junction coordinates according to GAF (version 2.0) with the nomenclature, chr::,chr:: e.g. chr1:10016340:+,chr1:10054672:+ raw_counts = Sum of fraction of reads that mapped to exon-exon junctions. Reads from pairs that did not pass Illumina?s Chastity filter or with mapping quality less than 10 were excluded from calculation. ======================================================================= BCCA GENOME SCIENCES CENTRE TCGA RNA-SEQ ANALYIS ALGORITHM DESCRIPTION Genome+Junctions Alignment, Repositioning, and Merging Protocol Name: bcgsc.ca:alignment:IlluminaHiSeq_RNASeq:01 Link: www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg18/bwa_ind/genome/tcga_ref/ Data Level: 1 Data File: *rnaseq.bam The Illumina paired-end sequence data were analyzed with BWA alignment software to map each read pair onto a genome+junction reference [1]. We aligned our sequence data to Hg18 (NCBI build 36.1). For Hg18 (NCBI build 36.1) alignments, the genomic sequence was restricted to chromosomes 1-22, M, X and Y. The exon-exon junction sequences and their corresponding coordinates were defined based on annotations of any transcripts in UCSC known genes, Ensembl (v54) or the Refseq database (as downloaded from the UCSC genome browser on March 2009). BWA's aln and sampe tools were used to generate small-gapped global alignments on the paired-end reads [2]. BWA (version 0.5.7) with the default parameter settings were used for alignments, except that the option (-s) to disable Smith-Waterman alignment was used because this feature was not designed to handle the insert size distribution that occurs in paired-end RNA-Seq data. In the final stage of the alignment, we set the record's flags. First, we took all reads that failed Illumina's Chastity filter and turned on bit 512 in that record's bitwise flag to indicate the read failed platform/vendor quality checks. Then we ran Picard's MarkDuplicates program (version 1.11) to set bit 1024 in the bitwise flag. After the alignment, the junction-aligned reads that mapped to exon-exon junctions were repositioned as large-gapped alignments in the genome based on the coordinates of the exons that were used to construct the junction sequences. Reads that aligned to junctions with insufficient overhang past the splice junction sites were changed to soft-clipped ungapped genomic alignments. All reads that were originally aligned to a junction sequence were tagged with "ZJ:Z". If the junction alignment was transformed into a gapped genomic alignment, a tag of the form, "ZJ:Z:-", was added where these two integers represented the coordinates of the first and the last base of the uncovered region. If the junction alignment was transformed into an ungapped genomic alignment, a tag of the form "ZJ:Z:G__" was added. If the junction alignment was determined to be of poor quality, it was converted to an "unmapped" status and various other forms of "ZJ:Z" tags were added to the record, depending on the source of the problem. When the target sequence depth was met, bam files generated from individual sequencing lanes were merged using samtools (version 0.1.8) into a single bam file representing the sequencing data of a TCGA sample. Finally, Picard's MarkDuplicates (version 1.11) was run again on the merged bam file. [1] Morin R. et al. Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques. 2008 Jul;45(1):81-94. [2] Li H. and Durbin R. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 2009; 25:1754-60. Gene Coverage Analysis Protocol Name: bcgsc.ca:RNA_Sequencing:IlluminaHiSeq_RNASeq:01 Link: www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg18/bwa_ind/genome/tcga_ref/ Data Level: 3 Data File: *.gene.quantification.txt The gene coverage analysis was performed with our internal analysis pipeline version 1.0 using "composite" gene annotations from the hg18 (NCBI v36.1) version of the TCGA GAF v2.0. These composite gene models were created in February 2011 by UNC (with assistance from UCSC) based on the annotations in the "UCSC genes" database. Each composite gene annotation was generated by collapsing all transcripts of that gene into a single model such that exonic bases in a composite gene model were the union of exonic bases from all known transcripts of the gene. Thus, the locations of the exonic boundaries used for the gene coverage analysis were not based on a single canonical transcript for each gene. Consequently, the exonic boundaries in a composite gene model may not correspond to the actual boundaries of the expressed transcripts. For simplicity, throughout this document and in the gene coverage results files, a composite gene model is simply referred to as a gene, and it is associated with the id of the gene whose transcripts contributed to that composite model. To generate the raw read counts, we first counted the number of bases of each read that were inside exonic regions in a gene, and then divided this total base count by the read length. Thus our values for the raw number of reads were not whole numbers (i.e. if the entire 50bp read mapped to an exon, we would add 50 to the total base count, which would ultimately contribute 1 to the raw read count. However, if only 25 bases of the read's alignment fell within an exon's boundaries, the total base count would be incremented by 25, which would ultimately contribute 0.5 to the raw read count). In order to comply with the file format specification enforced by the DCC validator, our raw read counts are rounded to the closest whole number. A gene's raw read count is the sum of raw read counts for exons belonging to the gene. Gene coverage is its raw read count divided by the sum of its exon lengths. RPKM is calculated using the formula: (number of reads mapped to all exons in a gene x 1,000,000,000)/(NORM_TOTAL x sum of the lengths of all exons in the gene ) [Note: NORM_TOTAL = the total number of reads that are mapped to all exons from the composite gene models. (i.e. sum of the fractional read count for all exons)] If a read alignment contained a deletion or a large gap, the read did not contribute coverage inside the region spanned by the deletion/gap. Each of the paired end reads was counted separately. We excluded reads from pairs that failed Illumina's Chastity filter, as well as reads with mapping quality < 10. Exon Coverage Analysis Protocol Name: bcgsc.ca:RNA_Sequencing:IlluminaHiSeq_RNASeq:01 Link: www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg18/bwa_ind/genome/tcga_ref/ Data Level: 3 Data File: *.exon.quantification.txt The exon coverage analysis was performed with our internal analysis pipeline version 1.0 using "composite" gene annotations from the hg18 (NCBI v36.1) version of the TCGA GAF v2.0. These composite gene models were created in February 2011 by UNC (with assistance from UCSC) based on the annotations in the "UCSC genes" database. Similarly to the gene coverage analysis, all transcripts of a given gene were collapsed into a single model such that exonic bases in a composite gene model were the union of exonic bases from all known transcripts of the gene. For simplicity, throughout this document and in the exon coverage results files, the collapsed exons are simply referred to as an exon. To generate the raw read counts, we first counted the number of bases of each read that were inside an exonic region, and then divided this total base count by the read length. Thus our values for the raw number of reads were not whole numbers (i.e. if the entire 50bp read mapped to an exon, we would add 50 to the total base count, which would ultimately contribute 1 to the raw read count. However, if only 25 bases of the read's alignment fell within an exon's boundaries, the total base count would be incremented by 25, which would ultimately contribute 0.5 to the raw read count). In order to comply with the file format specification enforced by the DCC validator, our raw read counts are rounded to the closest whole number. Exon coverage is the raw read count of an exon divided by its length. RPKM is calculated using the formula (number of reads (fractional) mapped to an exon x 1,000,000,000)/(NORM_TOTAL x length of an exon) [Note: NORM_TOTAL = the total number of reads (fractional) that mapped to exons, excluding those in the mitochondrial chromosome] If a read alignment contained a deletion or a large gap, the read did not contribute coverage inside the region spanned by the deletion/gap. Each of the paired end reads was counted separately. We excluded reads from pairs that failed Illumina's Chastity filter, as well as reads with mapping quality < 10. Junction Coverage Analysis Protocol Name: bcgsc.ca:RNA_Sequencing:IlluminaHiSeq_RNASeq:01 Link: www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg18/bwa_ind/genome/tcga_ref/ Data Level: 3 Data File: *.spljxn.quantification.txt The junction coverage analysis counts all the read sequences that span an annotated exon-exon junction. For sequences that were aligned to Hg18 (NCBI 36.1), junction coordinates were defined based on the gene annotation from Ensembl, Refseq and known genes from the UCSC genome browser, which was downloaded on March 3, 2009. The counting script excluded reads from pairs that failed Illumina's Chastity filter. Single Nucleotide Variant (SNV) calling analysis Protocol Name: bcgsc.ca:RNA_Sequencing:IlluminaHiSeq_RNASeq:01 Link: www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg18/bwa_ind/genome/tcga_ref/ Data Level: 2 Data File: *.snv.vcf SNVMix2 v0.12.1-rc1 [1] was used to detect putative single nucleotide variants (SNVs). Parameters "-t Mb" and "-Q 40" were used to filter out reads with a BWA mapping quality <40 and used both the mapping and base qualities to calculate the SNV probability. Reads flagged as duplicates by Picard's MarkDuplicates, and reads flagged as poor-quality (based on Illumina's Chastity filter) were also excluded from SNV calling. The values for mu and pi used by SNVMix2 to construct the model were taken from the MB_mQ40_bQ5.model provided by the author of the software. The resulting SNV list was converted to VCF format using a custom script developed at BCGSC, and the values in the filter column of the VCF file reflect the types of filtering that would be typically applied to call SNVs using SNVMix2, namely: refN : Reference base is N lowQ : The sum of p(ab) (i.e. the probability of a heterozygous SNV) and p(bb) (i.e. the probability of a homozygous SNV) for the alternative base is less than 0.99. (i.e. Less than 10 in a phred-like score that indicate the confidence of a SNV call) onldl : There is at least one read with an indel at the given position. edge : None of the reads supporting the SNV contains the SNV base at a position that is more than 5 bases from either end of the read. ss : All reads that support the SNV are mapped to the same strand. lowC : Low coverage junc : All reads that support the SNV also span an exon-exon junction. [1] Goya R. et al. SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors. Bioinformatics. 2010 March 15; 26(6): 730-736. Indel calling analysis Protocol Name: bcgsc.ca:RNA_Sequencing:IlluminaHiSeq_RNASeq:01 Link: www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg18/bwa_ind/genome/tcga_ref/ Data Level: 2 Data File: *.indel.vcf Samtools [1] mpileup v0.1.17 (r973:277) was used to detect putative indels. Parameters "-A", "-B", "-d 1000000", and "-L 1000000" were used to disable the BAQ computation and to ensure that indels could be called at all positions using all reads. The coefficient for down-grading the mapping quality for reads that contain excessive mismatches was set to 50 ("-C 50") following the recommendation from the author of the software regarding BWA alignments. Variants were called on each chromosome separately using the "-r" parameter, then the results were amalgamated into a single VCF file. The whole genome VCF file was then processed with a custom script developed at BCGSC to conform to the TCGA VCF specification. In addition to adding the required format fields, non-indel events were removed, and the VCF filter field was set to "q20" if the indel quality was less than 20, or the field was set to "PASS" if otherwise. [1] Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]

acronym: OV
disease: cancer
species: Homo sapiens
fileType: genomicMatrix
platform: IlluminaHiSeq_RNASeq
whitelist: pancan12-whitelist-2012_12_18.whitelist
dataCenter: bcgsc.ca
lastUpdate: 2012-07-30
tissueType: cell line control
dataSubType: geneExp
lastUpdated: 2012-07-30
sampleSetType: tumor