|Title:||Statistical Assessment of Enrichment in Ranked Lists - Algorithms and Applications in Motif Search
|Abstract:||Modern data analysis often faces the task of extracting characteristic features from sets of elements characterized with some measurement assay or procedure. In molecular biology, for example, an experiment may lead to measurement results pertaining to genes and then questions are asked about the properties of genes for which these were high or low. A central technique for analyzing characteristic properties of sets of elements is statistical enrichment. More specifically - the experiment results are often representable as ranked lists of elements and we then seek enrichment of other properties of these elements at the top or bottom of the ranked list. In our work, we developed statistical and algorithmic approaches that take as input ranked lists of sequences and return significant patterns, also called motifs, which are enriched at the top of the list. The efficiency of our approach, based on suffix trees, allows searches over motif spaces that are not covered by existing tools. This includes searching variable gap motifs – two half sites with a flexible length gap in between – and searching long motifs over large alphabets. Some of our methods are available through the DRIMust webserver (http://drimust.technion.ac.il/), which provides de-novo motif discovery services and addresses short motifs in DNA, RNA and protein sequences. It is computationally very efficient and allows for timely interaction with the results, through a friendly interface and a clear output format. Further extending the applicability of the methods, we developed an approach for assessing the significance of position weight matrix motifs in ranked lists of sequences. A position weight matrix (PWM) is a commonly used representation of motifs in biological sequences. This representation is more faithful to the underlying biology than representation by exact words. Currently, to the best of our knowledge, there is no statistical methodology for assessing PWM motifs in ranked lists. We developed upper bounds on tail distributions that are applicable in the context of assessing PWM motifs in ranked lists of sequences, improving over the existing knowledge in this respect. Our bounds can be calculated in polynomial time. Finally, in a study that applies statistics in ranked lists, we studied co-operativity between RNA binding proteins and microRNAs. Our work suggests that low-accessibility targets of microRNA-410 are regulated by an RNA binding protein from the Pumilio family in humans, where the latter possibly rescues microRNA recognition sites from highly structured regions, hereby cooperating with the microRNA.|
|Copyright||The above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information|
Remark: Any link to this technical report should be to this page (http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-info.cgi/2014/PHD/PHD-2014-05), rather than to the URL of the PDF or PS files directly. The latter URLs may change without notice.
To the list of the PHD technical reports of 2014
To the main CS technical reports page
Computer science department, Technion