bcgsc.ca_BRCA_IlluminaGA_miRNASeq

Created By Kyle Ellrott kellrott
Overview of miRNA analysis at the BCCA-GSC The analysis pipeline takes as input a .sam file containing BWA alignments for a sample, and generates a profile of the content of the sample. This includes metrics such as the expression of various RNAs with particular focus on the expression of miRNA species. Preprocessing Before any alignment can occur, some preprocessing on the sequences is necessary. MiRNAs tend to be shorter than the length of reads generated by sequencers, which means that the read will sequence past biological RNA into the 3' adapter of the sequencing protocol. This non-biological sequence at the end of the read will make it difficult for the aligner to align the read, so it is trimmed off first. The adapter trimming works by looking for as much of the specified adapter sequence in the read as possible, allowing mismatches based on the length of adapter found. Only the first 15 bases of the adapter sequence are used for the string matching, regardless of the length of the adapter sequence supplied. If a read matches 15bp of non-biological adapter, it is safe to assume the sequence truly is adapter sequence. For each read, the algorithm first checks if it is an adapter dimer by checking if the adapter sequence is at the start of the read. Since it has been seen in indexed reads that the first few bases of the adapter can be lost, not only is the full adapter sequence matched, the adapter sequence with the first 1, 2, and 3 bases trimmed off the beginning is also matched. If the read is not an adapter dimer, then the algorithm tries to match the 15 bases of adapter anywhere along the read, starting from the end. When matching the full 15bp of adapter, any match with up to 2 mismatches is accepted. If the full 15bp is not found, decreasing lengths of adapter, down to the first 8 bases, is checked with 1 mismatch allowed. If a match is still not found, 7 down to 1 base of adapter is checked with an exact match required. The algorithm will trim 1 base off the end of a read if it happens to match the first base of the adapter. This is because it is better to get a perfect alignment than an alignment with potentially 1 mismatch. In addition, if only 1 base of adapter was found, the read is likely not a miRNA and would not be a concern anyway. After each read has been processed, a report is generated containing the number of reads at each read length. Any read that is under 15bp after trimming is discarded, the rest is sent to alignment. 15bp is chosen because it is the shortest mature miRNA in miRBase. Annotation Given a .sam file, the first step of the pipeline is to annotate each of the aligned coordinates for each read against reference databases. A 3 base pair overlap is required, and in the event that a coordinate range overlaps features in multiple databases, each feature type is given a priority. The feature types, in the order that they are applied, are in Table 1. Table 1. Annotation priorities to resolve multiple matches. From miRBase. mature strand star strand precursor miRNA stemloop, from 1 to 6 bases outside the mature strand, between the mature and star strands "unannotated", any region other than the mature strand in miRNAs where there is no star strand annotated From UCSC RepeatMasker, other small RNAs. snoRNA tRNA rRNA snRNA scRNA srpRNA other RepeatMasker RNAs From UCSC knownGene. coding exons where the annotated CDS region is 0 length 3' UTR 5' UTR coding exon intron From UCSC RepeatMasker, other types of repeats. LINE SINE LTR Satellite RepeatMasker DNA RepeatMasker Low complexity RepeatMasker Simple Repeat RepeatMasker Other RepeatMasker Unknown Profiling After annotation is done, the results are aggregated to produce a report. For each read, a series of filters are used to check whether its alignment will be used. If the read has more than 3 alignments, it is discarded. Currently only perfect alignments with no mismatches are used. Based on expression profiles of test libraries, reads which fail chastity are kept, while reads which are soft-clipped by BWA are not. After filtering, each alignment annotation of the read is looked at to determine the annotation for the read itself. If a read has more than 1 alignment, and the annotations are different, the priorities from Table 1 are used as long as only 1 alignment is to a miRNA. If there are multiple alignments to different miRNAs (or even different regions of the same miRNA), the read is flagged as cross-mapped and every miRNA annotation is preserved. Using these rules, the reads are summed by annotation and coverage reports are generated. To generate a list of isoforms for all miRNAs, all reads annotated as miRNAs are grouped by annotation and sequence, then summed to produce the expression of each individual sequence. Data files The .mirna.quantification.txt data file gives summed expression for all reads aligning to a miRNA. The .isoform.quantification.txt data file gives expression for each individual sequence isoform observed. Data file format File formats for data files are also available online at https://wiki.nci.nih.gov/display/TCGAAnalysWG/BC+miRNA+Formats. The .mirna.quantification.txt data file describing summed expression for each miRNA is as follows: miRNA name raw read count reads per million miRNA reads cross-mapped to other miRNA forms (Y or N) The .isoform.quantification.txt data file describing every individual sequence isoform observed is as follows: miRNA name alignment coordinates as ::-: (coordinates are 1-based half-closed: [start,end)) raw read count reads per million miRNA reads cross-mapped to other miRNA forms (Y or N) region within miRNA

freeze: tcga_pancancer_v4
acronym: BRCA
disease: cancer
species: Homo sapiens
fileType: genomicMatrix
platform: IlluminaGA_miRNASeq
dataCenter: bcgsc.ca
tissueType: breast
dataSubType: miRNA
lastUpdated: 2012-10-12