Morgane Thomas-Chollier - Ecole Normale Supérieure        M2 ENS - September 2015

Session 1: Motif discovery

The goal of this first exercice is to learn how to predict potential regulatory motifs with no a priori knowledge on the regulating factor(s).
A typical situation is a group of co-expressed genes (e.g. from microarray data or RNA-seq), for which we want to identify a possible common regulatory motif. These approaches are also commonly used with genome-wide binding regions dectections, such as ChIp-chip or ChIP-seq data.

You will use the RSA-tools (commonly called "RSAT"). From the RSA-tools suite, you will use the programs:
oligo-analysis to discover over-represented words
dyad-analysis to discover over-represented dyads (=spaced motifs).
You will also learn to estimate the rate of false positive predictions and use appropriate controls to evaluate your results.

Discovering over-represented oligonucleotides

You will discover a potential regulatory motif of the genes regulated by the TF Spo0A, the main regulator of sporulation in the bacterium Bacillus subtilis. The list of target genes was obtained from Chip-chip experiment.
You will use the RSAT Teaching server
You will follow in part this protocol: "Using RSAT oligo-analysis and dyad-analysis tools to discover regulatory signals in nucleic sequences" by Defrance et al, Nature Protocols (2008) PDF
All input files are located here (copy/paste from the PDF EQUIPMENT setup section does not work well...) :
http://rsat.ulb.ac.be/rsat/data/published_data/nature_protocols/pattern_discovery/
  1. Read the Introduction. The study case 2 can be skipped, as this course do not cover the topic of phylogenetic footprints. Be sure not to skip the section Other applications of this protocol !
  2. Follow the procedure from step 1-13 with the program oligo-analysis (option A)

    Analyzing the results
    How many sequences were used ? Tip: Look at the information above the table containing the discovered words
    If we are looking for over-represented oligo-mers of size k, what is the maximum order of the background markov model ?
    Look at the top result. How many times was this oligonucleotide found in the input set ? How many times was it expected ? How was calculated this expected number ?
    In the feature-map, how do you explain the fact that some discovered oligonucleotides are overlapping ?

  3. To finish interpreting the above results, read section Anticipated results, application 1, option A
  4. Read Box 1, Box 2 and Box 3

Discovering over-represented dyads (spaced motifs)

FNR represses genes involved in aerobic respiration and activates genes required for anaerobic respiration. You will discover a potential spaced motif in the promoters of 98 target genes of the factor FNR, in Escherichia coli K12.
  1. Follow (again !) the procedure from step 1-13 with the program dyad-analysis (option B)
  2. Notice that you now work with Escherichia coli K12.

    Analyzing the results
    Did you find significant spaced motif(s) ?
    How were the dyads assembled to obtain the final motif ?

  3. To finish interpreting the above results, read section Anticipated results, application 1, option B

Retrieving sequences with RSAT

RSAT provides many utility tools, among which retrieve-seq, retrieve-ensembl-seq and fetch-sequences.
retrieve-seq: retrieve sequences relative to a reference (e.g. sequences upstream the TSS).
retrieve-ensembl-seq: idem, source of sequences = Ensembl database
fetch-sequences: retrieve sequences from a set of genomic coordinates. Source of sequences = UCSC database.
  1. Use the retrieve-seq program to retrieve upstream sequences
  2. Go on “retrieve-seq” form and follow the tutorial (follow the "TUTORIAL" link at the bottom of the form). The last "Exercices" section is optional.

Morgane Thomas-Chollier - Ecole Normale Supérieure        mthomas[at]biologie.ens.fr