You will scan the upstream region of the gene
even-skipped (eve) from the
Drosophila melanogaster genome with 12 matrices, representing the binding specificity of 12 factors that are known to regulate this gene.
The aim is to locate the putative binding sites for these 12 factors.
You will follow in part this protocol:
"Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules" by Turatsinze, Thomas-Chollier et al, Nature Protocols (2008)
PDF
- Open the article, read the Introduction until "Procedure"
- Read Box 1 and Box 2
matrix-scan allows a wide range of analyses, but is slow. For the prediction of TFBS, RSAT now offers a faster program called matrix-scan-quick that you will use today.
- In the Pattern matching menu, select matrix-scan (quick)
- Fill the sequences by providing the upstream sequence (5500bp) of the eve gene
- The Matrix section allows specifying the transcription factor-binding motif(s).
For the even-skipped study case, the file in in Transfac format. Copy/paste the file with the 12 matrices in the Matrix box. In the menu Matrix format, select 'transfac'.
- The next section of the form provides several options for specifying the background model (the statistical model for the sequences that do not correspond to instances of the motif). The choice of the background model crucially affects the results. For first analysis, select Markov chain order 0.
- Check the option organism-specific, and select Drosophila melanogaster and upstream-noorf.
- The section Scanning options determines the scanning mode and the parameters to return. The selector Origin specifies whether the origin for reporting coordinates should be the end or the start of the sequences. By default, the end is considered as the origin, so that the hits are reported with negative coordinates for upstream sequences.
- Select return sites + pval to compute the p-values associated to each predicted site.
- Defining a threshold on the P-value is the preferred approach. Set the value to 1e-4
- Click GO.
Analyzing the results
How many predictions do you obtain for these factors ?
Given the p-value threshold, how many false positives do you expect in this region ?
- Then produce the feature map. For the option color file, provide this file containing RGB color codes. It ensures that the factors always have the same colors on the graphs.
- Save the figure
Analyzing the results
Compare your figure with Figure 8a (top) presenting the annotated TFBS. Do your predictions seems correct ?
Are you missing some (=false negative) or finding more (= potential false positives) ?
- Rerun the analysis (=redo steps 3-13 above), by lowering the threshold to 1e-3
Analyzing the results
How many predictions do you obtain for these factors ?
Given the p-value threshold, how many false positives do you expect in this region ?
Compare again your results with Figure 8a. Do your predictions seems better or worse ? What is the possible drawback of setting a loose threshold ?
Estimating the amount of false positives
You will estimate the amount of false positives by re-running the exact same analysis, but using as input random sequences generated according to a realistic background model.
In this random control case, the matches you will obtain are all false-positives (ie sites that are predicted to be a TFBS but that connot be since you are working with artifical non-biological sequences).
Start by generating the random sequences.
- In the Build control sets menu, select random sequences
- For the sequence length : 5500 (same length as the eve upstream region)
- Number of sequences : 10
- In the Background section, select the background model that will serve to produce the random sequence. Choose Drosophila melanogaster for the organism. Keep oligonucleotide size to 6 to use a Markov background model of order 5 (read Box 3 for details on Markov models).
- Click on Go
- Save the generated sequence file
- Redo steps 3-13 above (only for threshold 1e-4), using this set of random sequences as input (step 4)
Analyzing the results
How many predictions do you obtain for these factors ?
Does that correspond to the expected number of false-positives ?
The protocol also presents random controls in the form of randomly selected regions, which are biological sequences instead of artificial ones. Other controls like permuting the matrices are also common in research projects.
If you're interested to go further, the protocol also presents the search for regulatory modules (CRERs).