Morgane Thomas-Chollier - Ecole Normale Supérieure M1 ENS - Ocotber 2015

Exercise 2: String-based pattern matching

The goal of the second and third exercices is to learn how to search for a known motif within a DNA sequence of interest.
This technique is here applied to predict transcription factor binding sites (TFBS), but it may be applied to search for other biological signals such as exon/intron boundaries, restriction sites...
From the RSA-tools suite, you will use the programs:
dna-pattern with motifs described as consensus sequences => Exercise 2
matrix-scan with motifs described as matrices => Exercise 3

You will also learn to estimate the rate of false positive predictions and use appropriate controls to evaluate your results.

Search for a motif described as a consensus sequence

You have a list of upstream regions of a selection of nitrogen-responding genes in the yeast, you will search the positions of putative GATA boxes and Hap sites within these regions of 800bp.
This exercice is adapted from the Tutorial of the program dna-pattern, accessible from the RSAT website at the bottom of the tool form.

In the Pattern matching menu, select dna-pattern
In the Query pattern(s) box, you will enter the patterns to be searched for. Each pattern must come on a separate line. The first word of each line is the string description of the pattern, the second word is an identifier for this pattern. Type the following text in the Query pattern(s) box:
```
GATAAG	  Gata_box
CCAAY	  Hap_site
```
Note the use of degenerate IUPAC degenerate code: the Y from CCAAY on the second line means "either C or T".
For the sequences, download this sequence file, then select it from your computer in the sequence section
Leave all other parameters unchanged and click GO.

Analyzing the results
You see now the positions of all matches with the patterns you entered within the upstream sequences of the selected genes. Each line shows a single match, and the different columns indicate respectively:
- pattern identifier
- strand on which the match was found (D for direct, R for Reverse)
- pattern searched for (i.e. the query strings you provided)
- name of the sequence in which it was found
- starting position of the match
- end position of the match
- match sequence. The matching bases are indicated in UPPERCASES. The 4 flanking bases at left and right are in lowercases.
- matching score. In this case all scores equal 1
Notice that positions are returned in negative coordinates, relative to the end of the sequence (the last nucleotide has position -1). This behaviour was selected with the "Origin" option in the dna-pattern form (Origin=end). This option is particularly useful for analyzing regulatory sequences, but it can be inactivated in other cases.

You will now display the same results graphically, on a feature map.

Click on the Feature map button on the bottom of the result page. The results from the previous page have been automatically transferred to this form.

In the Title box, type

Gata boxes and Hap sites in the upstream regions of NIT genes

Leave other parameters unchanged and click GO.
Save the image on your computer.

Analyzing the results
After a few seconds, the feature map should appear. A few comments:
- Gata boxes appear in blue, Hap sites in red
- Color boxes are displayed either above or below the horizontal black lines, accordingly to the strand of the match.

Estimating the amount of false positives

You will estimate the amount of false positives by re-running the exact same analysis, but using as input random sequences generated according to a realistic background model.
In this random control case, the matches you will obtain are all false-positives (ie sites that are predicted to be a TFBS but that connot be since you are working with artifical non-biological sequences).
Start by generating the random sequences.

In the Build control sets menu, select random sequences
In the template section, upload your original sequence file. It will serve as template to generate the same number of sequences, of the exact same lengths as your original file
Keep Saccharomyces cerevisiae as background model
Click on Go
Save the generated sequence file
Redo steps 1-8 above, using this set of random sequences as input(step 3)

Analyzing the results
You should obtain a feature map with the random sequences.
Do you see false positives (=spurious matches) in these random sequences ?
How many matches do you have with the original sequences ? How many with the random sequences ?
How confident are you when looking again at your original results ? Do you think all the matches are correct predictions ?

The control shows that a high number of predictions are actually false positives ! Keep in mind that spurious matches are expected to be found in any sequences, just by chance.

Morgane Thomas-Chollier - Ecole Normale Supérieure mthomas[at]biologie.ens.fr