Morgane Thomas-Chollier - Ecole Normale SupĂ©rieure        M2 ENS - September 2014

Exercise 3: Matrix-based pattern matching

The goal of the second and third exercices is to learn how to search for a known motif within a DNA sequence of interest.
This technique is here applied to predict transcription factor binding sites (TFBS), but it may be applied to search for other biological signals such as exon/intron boundaries, restriction sites...
From the RSA-tools suite, you will use the programs:
dna-pattern with motifs described as consensus sequences => Exercise 2
matrix-scan with motifs described as matrices => Exercise 3

You will also learn to estimate the rate of false positive predictions and use appropriate controls to evaluate your results.

Search for a motif described as a matrix

You will scan the upstream region of the gene even-skipped (eve) from the Drosophila melanogaster genome with 12 matrices, representing the binding specificity of 12 factors that are known to regulate this gene. The aim is to locate the putative binding sites for these 12 factors.
You will follow in part this protocol: "Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules" by Turatsinze, Thomas-Chollier et al, Nature Protocols (2008) PDF
  1. Open the article, read the Introduction until "Procedure"
  2. Read Box 1 and Box 2
  3. matrix-scan allows a wide range of analyses, but is slow. For the prediction of TFBS, RSAT now offers a faster program called matrix-scan-quick that you will use today.

  4. In the Pattern matching menu, select matrix-scan (quick)
  5. Fill the sequences by providing the upstream sequence (5500bp) of the eve gene
  6. The Matrix section allows specifying the transcription factor-binding motif(s). For the even-skipped study case, the file in in Transfac format. Copy/paste the file with the 12 matrices in the Matrix box. In the menu Matrix format, select 'transfac'.
  7. The next section of the form provides several options for specifying the background model (the statistical model for the sequences that do not correspond to instances of the motif). The choice of the background model crucially affects the results. For first analysis, select Markov chain order 0.
  8. Check the option organism-specific, and select Drosophila melanogaster and upstream-noorf.
  9. The section Scanning options determines the scanning mode and the parameters to return. The selector Origin specifies whether the origin for reporting coordinates should be the end or the start of the sequences. By default, the end is considered as the origin, so that the hits are reported with negative coordinates for upstream sequences.
  10. Select return sites + pval to compute the p-values associated to each predicted site.
  11. Defining a threshold on the P-value is the preferred approach. Set the value to 1e-4
  12. Click GO.
  13. Analyzing the results
    How many predictions do you obtain for these factors ? Given the p-value threshold, how many false positives do you expect in this region ?

  14. Then produce the feature map. For the option color file, provide this file containing RGB color codes. It ensures that the factors always have the same colors on the graphs.
  15. Save the figure
  16. Analyzing the results
    Compare your figure with Figure 8a (top) presenting the annotated TFBS. Do your predictions seems correct ?
    Are you missing some (=false negative) or finding more (= potential false positives) ?

  17. Rerun the analysis (=redo steps 3-13 above), by lowering the threshold to 1e-3
  18. Analyzing the results
    How many predictions do you obtain for these factors ?
    Given the p-value threshold, how many false positives do you expect in this region ?
    Compare again your results with Figure 8a. Do your predictions seems better or worse ? What is the possible drawback of setting a loose threshold ?


    Estimating the amount of false positives

    You will estimate the amount of false positives by re-running the exact same analysis, but using as input random sequences generated according to a realistic background model.
    In this random control case, the matches you will obtain are all false-positives (ie sites that are predicted to be a TFBS but that connot be since you are working with artifical non-biological sequences).
    Start by generating the random sequences.

  19. In the Build control sets menu, select random sequences
  20. For the sequence length : 5500 (same length as the eve upstream region)
  21. Number of sequences : 10
  22. In the Background section, select the background model that will serve to produce the random sequence. Choose Drosophila melanogaster for the organism. Keep oligonucleotide size to 6 to use a Markov background model of order 5 (read Box 3 for details on Markov models).
  23. Click on Go
  24. Save the generated sequence file
  25. Redo steps 3-13 above (only for threshold 1e-4), using this set of random sequences as input (step 4)
  26. Analyzing the results
    How many predictions do you obtain for these factors ?
    Does that correspond to the expected number of false-positives ?

    The protocol also presents random controls in the form of randomly selected regions, which are biological sequences instead of artificial ones. Other controls like permuting the matrices are also common in research projects. If you're interested to go further, the protocol also presents the search for regulatory modules (CRERs).

Morgane Thomas-Chollier - Ecole Normale SupĂ©rieure        mthomas[at]biologie.ens.fr