Morgane Thomas-Chollier - Ecole Normale Supérieure M2 ENS - September 2015

Session 1: Motif discovery

The goal of this first exercice is to learn how to predict potential regulatory motifs with no a priori knowledge on the regulating factor(s).
A typical situation is a group of co-expressed genes (e.g. from microarray data or RNA-seq), for which we want to identify a possible common regulatory motif. These approaches are also commonly used with genome-wide binding regions dectections, such as ChIp-chip or ChIP-seq data.

You will use the RSA-tools (commonly called "RSAT"). From the RSA-tools suite, you will use the programs:
oligo-analysis to discover over-represented words
dyad-analysis to discover over-represented dyads (=spaced motifs).
You will also learn to estimate the rate of false positive predictions and use appropriate controls to evaluate your results.

Discovering over-represented oligonucleotides

You will discover a potential regulatory motif of the genes regulated by the TF Spo0A, the main regulator of sporulation in the bacterium Bacillus subtilis. The list of target genes was obtained from Chip-chip experiment.
You will use the RSAT Teaching server
You will follow in part this protocol: "Using RSAT oligo-analysis and dyad-analysis tools to discover regulatory signals in nucleic sequences" by Defrance et al, Nature Protocols (2008) PDF
All input files are located here (copy/paste from the PDF EQUIPMENT setup section does not work well...) :
http://rsat.ulb.ac.be/rsat/data/published_data/nature_protocols/pattern_discovery/

Read the Introduction. The study case 2 can be skipped, as this course do not cover the topic of phylogenetic footprints. Be sure not to skip the section Other applications of this protocol !
Follow the procedure from step 1-13 with the program oligo-analysis (option A)
Analyzing the results
How many sequences were used ? Tip: Look at the information above the table containing the discovered words
If we are looking for over-represented oligo-mers of size k, what is the maximum order of the background markov model ?
Look at the top result. How many times was this oligonucleotide found in the input set ? How many times was it expected ? How was calculated this expected number ?
In the feature-map, how do you explain the fact that some discovered oligonucleotides are overlapping ?
To finish interpreting the above results, read section Anticipated results, application 1, option A
Read Box 1, Box 2 and Box 3

Discovering over-represented dyads (spaced motifs)

FNR represses genes involved in aerobic respiration and activates genes required for anaerobic respiration. You will discover a potential spaced motif in the promoters of 98 target genes of the factor FNR, in Escherichia coli K12.

Follow (again !) the procedure from step 1-13 with the program dyad-analysis (option B)

Notice that you now work with Escherichia coli K12.

Analyzing the results
Did you find significant spaced motif(s) ?
How were the dyads assembled to obtain the final motif ?

To finish interpreting the above results, read section Anticipated results, application 1, option B

Retrieving sequences with RSAT

RSAT provides many utility tools, among which retrieve-seq, retrieve-ensembl-seq and fetch-sequences.
retrieve-seq: retrieve sequences relative to a reference (e.g. sequences upstream the TSS).
retrieve-ensembl-seq: idem, source of sequences = Ensembl database
fetch-sequences: retrieve sequences from a set of genomic coordinates. Source of sequences = UCSC database.

Use the retrieve-seq program to retrieve upstream sequences
Go on “retrieve-seq” form and follow the tutorial (follow the "TUTORIAL" link at the bottom of the form). The last "Exercices" section is optional.

Morgane Thomas-Chollier - Ecole Normale Supérieure mthomas[at]biologie.ens.fr