Morgane Thomas-Chollier - Ecole Normale Supérieure
M2 ENS - September 2015
Session 1: Motif discovery
The goal of this first exercice is to learn how to predict potential regulatory motifs with no a priori knowledge on the regulating factor(s).
A typical situation is a group of co-expressed genes (e.g. from microarray data or RNA-seq), for which we want to identify a possible common regulatory motif. These approaches are also commonly used with genome-wide binding regions dectections, such as ChIp-chip or ChIP-seq data.
You will use the RSA-tools (commonly called "RSAT"). From the RSA-tools suite, you will use the programs:
oligo-analysis to discover over-represented words
dyad-analysis to discover over-represented dyads (=spaced motifs).
You will also learn to estimate the rate of false positive predictions and use appropriate controls to evaluate your results.
Discovering over-represented oligonucleotides
You will discover a potential regulatory motif of the genes regulated by the TF
Spo0A, the main regulator of sporulation in the bacterium
Bacillus subtilis.
The list of target genes was obtained from Chip-chip experiment.
You will use the
RSAT Teaching server
You will follow in part this protocol:
"Using RSAT oligo-analysis and dyad-analysis tools to discover regulatory signals in nucleic sequences" by Defrance et al, Nature Protocols (2008)
PDF
All input files are located here (copy/paste from the PDF EQUIPMENT setup section does not work well...) :
http://rsat.ulb.ac.be/rsat/data/published_data/nature_protocols/pattern_discovery/
- Read the Introduction. The study case 2 can be skipped, as this course do not cover the topic of phylogenetic footprints. Be sure not to skip the section Other applications of this protocol !
- Follow the procedure from step 1-13 with the program oligo-analysis (option A)
Analyzing the results
How many sequences were used ? Tip: Look at the information above the table containing the discovered words
If we are looking for over-represented oligo-mers of size k, what is the maximum order of the background markov model ?
Look at the top result. How many times was this oligonucleotide found in the input set ? How many times was it expected ? How was calculated this expected number ?
In the feature-map, how do you explain the fact that some discovered oligonucleotides are overlapping ?
- To finish interpreting the above results, read section Anticipated results, application 1, option A
- Read Box 1, Box 2 and Box 3
Discovering over-represented dyads (spaced motifs)
FNR represses genes involved in aerobic respiration and activates genes required for anaerobic respiration. You will discover a potential spaced motif in the promoters of 98 target genes of the factor FNR, in Escherichia coli K12.
- Follow (again !) the procedure from step 1-13 with the program dyad-analysis (option B)
Notice that you now work with Escherichia coli K12.
Analyzing the results
Did you find significant spaced motif(s) ?
How were the dyads assembled to obtain the final motif ?
- To finish interpreting the above results, read section Anticipated results, application 1, option B
Retrieving sequences with RSAT
RSAT provides many utility tools, among which retrieve-seq, retrieve-ensembl-seq and fetch-sequences.
retrieve-seq: retrieve sequences relative to a reference (e.g. sequences upstream the TSS).
retrieve-ensembl-seq: idem, source of sequences = Ensembl database
fetch-sequences: retrieve sequences from a set of genomic coordinates. Source of sequences = UCSC database.
- Use the retrieve-seq program to retrieve upstream sequences
- Go on “retrieve-seq” form and follow the tutorial (follow the "TUTORIAL" link at the bottom of the form). The last "Exercices" section is optional.
Morgane Thomas-Chollier - Ecole Normale Supérieure
mthomas[at]biologie.ens.fr