Morgane Thomas-Chollier - Ecole Normale Supérieure
M1 ENS - October 2015
Exercise 1: Motif descriptors
The goal of this first exercice is to manipulate different motif descriptors:
consensus sequences and
count matrices.
You will use the program
convert-matrix from the RSA-tools suite. This tool allows to :
- Perform interconversions between various matrix formats
- Produce consensus and regular expression descriptors
- Calculate various statistics on the PSSMs
- Reverse PSSMs
- Permute PSSMs
- Construct logos
As input data, you will construct a matrix from a multiple alignment, and also fetch a count matrix from the
JASPAR database.
The matrices describes the binding motifs of transcription factors seen in the course
(Meis and Gcn4).
Constructing a personal matrix
- Go to the RSAT teaching server
- In the Matrix tool menu, select convert matrix
You will construct a count matrix for the factor Meis, from the multiple alignment of TFBS extracted from various vertebrate genomes. The alignment is in FASTA format.
Note that the tool allows to convert to a wide range of formats.
- Copy the following alignement in the matrix box, and select as format sequences
>1
TGACAA
>2
TGACAG
>3
TGATGG
>4
TGACAA
>5
TGGCAG
>6
TGATTG
>7
TGACAG
>8
TGACAG
- The background model is not used in this exercice, you can leave the default option
- Click on Go to run the program with default parameters
Questions
Compare the computed matrix with the one you made manually during the course.
Look at the consensus sequence under the matrix. Is it strict or degenerate ? Compare it with the one you made manually during the course.
Have a look at the logo, note how the height in each column is different.
- Go back to the previous page, rerun the program by choosing as output format transfac
Questions
The transfac format is very different from the tab format used before. What is the main difference ?
This is the format used by the TRANSFAC database ; this format is used by many bioinformatics tools, and has the advantage of integrating a name and identifier (ID and AC fields) within the matrix format.
Obtaining a matrix from a database
You will retrieve the Gcn4 count matrix from the JASPAR database.
- Go to the Jaspar database website
- Search by name the Gcn4 factor
- Click on the logo to get more details
Questions
To which family does this transcription factor belong ? From which organism was built the matrix ?
The logo does not look really "nice", you will computes a logo in PDF format, usable for publications
- Copy/paste the matrix into convert matrix
- Keep the input format as tab, not jaspar !
- Run the program to produce a logo, if you click on it, you can download the PDF file
Morgane Thomas-Chollier - Ecole Normale Supérieure
mthomas[at]biologie.ens.fr