miSNP Analysis

Created By Anonymous anonymous
Hamilton et. al. miSNP and AGO-CLIP atlas brief description of data. AGO-CLIP SRR files corresponding to all human AGO-CLIP runs were downloaded from the NIH sequence read archive. Files were individually pre-processed using Fastx toolkit and cut-adapt to remove adaptor sequences and control for sequence quality. Read length was filtered to 15 nucleotides for most samples. Dataset quality was individually discerned using Fast QC reader. Individual sequencing runs were tiered and grouped based first on group publishing, then on cell line used, then on individual treatment, and finally based on total reads. In this manner fourteen independent AGO-CLIP datasets were defined. Eleven of these datasets were PAR-CLIP experiments. Three were HITS-CLIP. Reads were mapped to hg19 using Bowtie and parameters established in Corcoran et. al., 2011. Bowtie files were then run through the PARalyzer algorithm to generate clusters and seeds. A lenient seed mapping strategy was used that included all miRBase seed families. The purpose of this lenient mapping was to anchor redundant read clusters to the genome to determine seed site recurrence across all datasets. Using this strategy 99% of 123752 clusters mapping to the UCSC known gene transcriptome, though likely at the expense of false positives in less expressed microRNAs. From these cluster sequences 306733 seed sequences were inferred -- notably multiple seeds may be inferred from a single cluster sequence and these seeds often overlap a single site. The identity of the actual binding partner may be one of these seeds, all of them, or may represent a form of non-canonical binding that is not considered in our current motif-based analysis, but will be incorporated at a later date. AGO-HITS-CLIP runs were also run through PARalyzer, and group datasets were isolated. These groups were superimposed over microRNA seeds identified by the PAR-CLIP reads. In this way AGO-CLIP datasets were allowed to support a seed identified by the PAR-CLIP runs, but were not allowed to perform de novo seed nomination. Following seed nomination, each seed, which is anchored as a specific point on the genome, was grouped for recurrence across the 14 datasets. Following this permutation analysis was performed to determine the likelihood of a given seed being recurrently identified by chance, and a FDR was assigned. We determined seed recurrence of 3 or more corresponded to a Q-value <0.05. A file index of used AGO-CLIP runs is included. miSNP The miSNP code seeks to integrate PAR-CLIP seed or cluster calls with integrative genomic data generated by the TCGA. Specifically, miSNP is formulated to intersect genomic features with putative seed sites and is then able to retrieve mRNA expression data related to that feature. Our current formulation of miSNP focuses on determining seed-SNV interactions, future experiments will focus on mRNA editing events. Current iterations of our algorithm are limited by the TCGA exome sequencing data itself. While 1000 complete 3'UTRs are included in the exome sequencing pull-downs, most of these important sites are filtered out of downstream analyses. The only 3'UTR data available and annotated in the Synapse whitelist is in KIRC. We have additionally intersected 36 COAD whole genome sequencing samples. As the vast majority of AGO-CLIP clusters fall in the 3'UTRs, this limitation has forced us to look in the less populous coding region seeds, and seeds identified in noncoding RNAs. These Whitelist sites and the KIRC 3'UTR have generated 7876 seed-SNV binding site interactions, primarily in non-canonical seeds. Notably, 990 seed-SNV interactions are generated from just 36 COAD WGS samples with 78.5% of these sites mapping to the 3'UTR, highlight the impressive amount of data that will be gained when larger 3'UTR datasets exist. Synapse Data Provided are two forms of miSNP data output. The first, annotated as _union files, represents a union of all seed-SNV intersections, highlighting tumor type, barcode, and mutated base. The second type, annotated as _summary, represents mRNA expression data for each gene corresponding to the seed intersections. Due to the limited number of CDS seeds, current expression data groups all samples with a seed mutation for comparison to the larger dataset. Seeds are identified by type and by sequence. The PARalyzer seed calling nomenclature is used which differs slightly from official nomenclature. Namely, PARalyzer will identify perfect pairing with the microRNA seed for x bases and label this as an xmeric seed. If an upstream A is present in the complementary mRNA sequence a 1A is added to the end. This transforms TargetScan's traditional 7mer1A to a 6mer1A in the PARalyzer nomenclature, and so on. For each gene annotation a TCGA ID and a UCSC gene ID is given. These are almost always concurrent, but the TCGA SNV data accounts for only a single strand, where the AGO-CLIP data frequently detects binding sites on pseudogenes and ncRNAs synthesized from opposite strands and thus providing a different annotation. It is also possible in some cases the gene coordinates of UCSC known gene and those used by the TCGA are non-congruent.

syn1720733
syn1703136
syn1899820
syn1721863
syn1720734
syn1856597
syn1703154
syn1703143