Morgane Thomas-Chollier - Ecole Normale Supérieure        M2 ENS - September 2015

Session 4: ChIP-seq primary analysis = from reads to peaks

The goal of this exercise is that to treat the raw data (reads) to ultimately define peak regions, i.e. regions with a higher number of reads than expected by chance. For transcription factor ChIP-seq, peak regions are considered as putative binding locations. For histone modification datasets, peak regions reveal local enrichment in this particular histone modification. For the sake of time for this practical, we will not perform the quality check and read mapping steps. We will start with files of aligned reads obtained from a read aligner (e.g. Bowtie, BWA, etc...) in BAM format.

You will use the programs:
MACS to perform the peak calling from the mapped reads
UCSC browser to visualize the peaks.

Inspecting the dataset

Dataset: the data used in this practical result from a ChIP-seq experiment conducted in mouse stem cells, published as: Chen et al (2008) Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells Cell 133(6),1106–1117
We will focus on the Oct4 transcription factor.
  1. We have already aligned the reads on the mouse genome assembly version mm9 using Bowtie with the following options: -v 2 -m 1 -q mm9
  2. open a terminal and type bowtie to obtain the manual of the program.

    Question
    Explain the options used for mapping with Bowtie.

  3. Both the Oct4 and the Input (here called GFP) datasets have been processed.
  4. The bam files are located in /genomic/ue_m2_cisreg : GFP_forstorage.sort.bam / Oct4_forstorage.sort.bam

    Question
    Which files respectively correspond to the ChIP and the input ?
    Look at the size of the bam files. Do not copy them in you home folder...

  5. Use samtools flagstat to obtain the number of raw reads and the number of (uniquely) mapped reads.
  6. Question
    Comment on the % of mapped reads, taking into account the date of publication.


Peak-calling

You will use the program MACS to perform the peak-calling. MACS can be used with or without control (=input in this practical) dataset.
MACS can be used on ChIP-seq targeting transcription factors and chromatine modifications, but with different sets of parameters.
  1. In the command line, call the program with macs14
  2. You will first run MACS without the input. Look at the parameters in the manual so as to use -t -n -f -g correctly. The name should be "Oct4_no_input".
  3. Remember that you work with mouse sequences !

    Analyzing the results
    In the log of the programs, find the size of the reads (also named tag). Also check the lines concerning the filtering of the tags. What does the program MACS do with duplicated reads ?
    Look at the result files. Do you get files in standard formats ? How many peaks did you call ? (hint: calculate the number of lines in the .bed file)
    In the .xls file, what is the d value ?

  4. Now run MACS with the control dataset. Which option should you use ? The name should be "Oct4_vs_GFP"
  5. Analyzing the results
    How many peaks did you call this time ?
    Look at the .xls file. A new column is present. What type of values are in this column ?

  6. Select the peaks that were not called when using the control. For this, we will use the bedtools.
  7. bedtools intersect -a Oct4_alone_peaks.bed -b Oct4_vs_GFP_peaks.bed -v > Oct4_peaks_discarded.bed

    Question
    Look in the bedtools manual to understand the -v option

  8. You will now select only the peaks with a FDR value <= 2%
  9. grep -v '^#' Oct4_vs_GFP_peaks.xls | awk '{OFS="\t" ; if ($9 <=2){print}}' > Oct4_vs_GFP_peaks_fdr2.txt
  10. Extract the same peaks from the BED file (standard format) to be used later in other tools.
  11. bedtools intersect -a Oct4_vs_GFP_peaks.bed -b Oct4_vs_GFP_peaks_fdr2.txt > Oct4_vs_GFP_peaks_fdr2.bed

    Analyzing the results
    How many peaks are left ?


Visualizing the results

You will visualize the results in the UCSC browser.
(You may also try the IGV browser).
  1. On the website, select mm9 as the assembly. Then click on the button "add custom track".
  2. On the next page, upload the file Oct4_vs_GFP_peaks_fdr2.bed
  3. On the next page, click on the button "go to genome browser"
  4. Question
    How are your data shown ? Do you see the peak "shape" ?

  5. You will now make bedGraph file of the signal using the deeptools suite (http://deeptools.ie-freiburg.mpg.de/) bamCoverage program (do it only on chromosome 6 so it is faster).
  6. bamCoverage -b /genomic/ue_m2_cisreg/Oct4_forstorage.sort.bam --region chr6 --missingDataAsZero no -o Oct4_chr6.bedgraph
  7. We can add a title for the track, so the ucsc genome browser will automatically give a name to the track. Add the following line in the begining of the file:
  8. track name=Oct4_ChIPseq_chr6 type=bedGraph
  9. Do the same for the control library.
  10. bamCoverage -b /genomic/ue_m2_cisreg/GFP_forstorage.sort.bam --region chr6 --missingDataAsZero no -o GFP_chr6.bedgraph
  11. Add the following line in the begining of the file:
  12. track name=Control_ChIPseq_chr6 type=bedGraph
  13. Click on the top menu My data > Custom tracks
  14. Click on "Choose File", select your bedgraph file, click on "open". Then click on "Submit" and wait for a newpage to load.
  15. Once the new page is loaded click on "go".
  16. Analyzing the results
    Look at some peak regions, and also look at some regions that were discarded when using the control. Why do you think these peaks were discarded ?


Morgane Thomas-Chollier - Ecole Normale Supérieure        mthomas[at]biologie.ens.fr