Skip to the content.

MultiRepMacsChIPSeq - Usage

Home Overview Usage Variations Examples Applications Install

Usage guide

Below is a usage guide of the Multi-Replica Macs ChIP-Seq Wrapper. This is primarily focused on ChIPSeq, but the same principles can (more or less) be applied for ATAC-Seq and Cut & Run (and Cut & Tag) analyses.

Alignment

ChIP-Seq samples may be aligned with your favorite aligner, including NovoCraft Novoalign, Bowtie2, or BWA. Alignments should be converted to sorted, indexed Bam files using samtools or equivalent.

Duplicate alignments do not need to be removed, unless you are using Unique Molecular Index (UMIs) tags, in which case the files should be de-duplicated first; see the duplicate step below.

There are two metrics that are helpful to know prior to the execution of the pipeline. This includes estimating (single-end) or determining (paired-end) fragment sizes and checking duplicate levels.

Fragment length estimation

For single-end analysis, fragment length must be estimated by looking at the average offset between forward and reverse alignments flanking a peak. There are two ways to determine this: using Macs2 predictd function and BioToolBox bam2wig. It’s probably recommended to run both, as each has similar but different algorithms that usually derive similar values, and then evaluate and take the most reasonable one value.

For paired-end analysis, only properly-mapped pairs are analyzed (singletons are skipped for simplicity). Knowing the mean fragment size allows you to intelligently set the expected peak size. Some aligners, e.g. Novoalign, will report mean fragment length. Otherwise, a tool such as Picard CollectInsertSizeMetrics will work.

Duplication level determination

The pipeline is designed to explicitly handle duplicate alignments in a number of ways. Knowing the duplication level in advance may help to set a reasonable expectation for how to handle duplicates. One of the methods is duplication sub-sampling; see

You can obtain a quick determination by running the bam_partial_dedup application for each file. If desired, capture the standard output to file, which can then be combined into a single file with combine_std_chipstats. Be sure to specify paired-end alignments with --pe option as necessary.

bam_partial_dedup.pl -i file1.bam > file1.dup.txt

combine_std_chipstats.pl dup_stats.txt file*.dup.txt

Note that it is best to exclude known sources of duplicate reads, such as the mitochondrial chromosome, rDNA and other high-copy number genes,repetitive elements, and other known sites of artificial enrichment. In the pipeline, these should be excluded by default, but you can manually specify the exclusion lists and chromosomes in bam_partial_dedup if available.

bam_partial_dedup.pl -i file1.bam --chrskip chrM --exclude exclusion.bed > file1.dedup.txt

NOTE that if your sample has high optical duplicates, e.g. sequence from a patterned Illumina flowcell like NovaSeq, please add the --optical and --distance options to remove these. Optical duplicates are entirely artificial sequencing artifacts and should not be considered. For Novaseq, set the pixel distance to at least 2500; for NovaseqX, set the pixel distance to 200.

Running the pipeline

The pipeline is executed by passing all parameters to the multirep_macs2_pipeline script. Because of the number and complexity of the options, it’s generally recommended to place the command in a shell script and execute that. Before running, it’s highly recommended to run the command with the --dryrun option to check that all inputs are valid.

The script will spawn numerous separate jobs (sub-tasks) in steps, running them in parallel. Job control is explained below in the options section. In general, the script is designed to run on a single workstation or compute node, rather than spawning jobs through a cluster job manager. Each step command is printed to STDOUT as the pipeline is run. Individual job output is captured in separate .out.txt files and combined into a single log file upon completion. Progress is tracked using a simple output.progress.txt file, which is used to skip finished steps when restarting an incomplete run.

Simple Peak call

To call peaks on a single-condition ChIP-Seq sample with multiple replicates, call the multirep_macs2_pipeline script, treating each replicate as a separate sample. This allows the replicates to be compared directly. If each replicate has a separate reference, then a separate --control option can be repeated for each sample in the same order.

multirep_macs2_pipeline.pl \
--chip file1.bam \
--chip file2.bam \
--chip file3.bam \
--control input.bam \
--name my_experiment 

Each replicate will have a separate peak call, and the replicates merged into a single master set of peaks for subsequent analysis and comparison.

Multiple-sample peak calls

When multiple conditions are being tested and compared for differential binding, then simply add additional --chip and --name arguments to the multirep_macs2_pipeline script. In this case, replicates for each condition are averaged together prior to peak calling, with the assumption that a greater diversity and variance will be observed between the conditions than between the replicates within each condition. Independent replicate peak calls can also alternatively be made, if desired, then merged.

NOTE The order of multiple --chip, --name, and --control options is critical, and must be provided in the same order. If there are multiple controls, and some are shared between more than one ChIP but not all, then that’s ok; list each control for each ChIP and duplicate entries will be smartly handled.

multirep_macs2_pipeline.pl \
--chip file1.bam,file2.bam,file3.bam \
--chip file4.bam,file5.bam,file6.bam \
--chip file7.bam,file8.bam \
--control input1.bam \
--control input2.bam \
--control input3.bam \
--name condition1 \ 
--name condition2 \ 
--name condition3 \
--out all_chips

At the end of the pipeline, individual peaks from all conditions are merged into a single master list of all peaks identified. This is used in subsequent differential analysis to determine significant differential occupancy. A number of comparison plots are generated to assist in evaluation.

Pipeline options

These are descriptions and guidance to the variety of options to the main multirep_macs2_pipeline script. In most cases, you will want to write the command in a shell script for execution due to the complexity. See the examples folder for example scripts.

Evaluation

Unlike more straightforward types of analyses, ChIP-Seq is highly variable, owing to the particular nature and behavior of the protein being analyzed. Some have nice, narrow, tall peaks, while others are broad and diffuse. The best way to interpret ChIP-Seq analysis is to not rely on tables of numbers, but rather to view the generated bigWig data tracks in a genome browser, such as IGV. In particular, load the called peak files (.narrowPeak), statistical enrichment (.qvalue.bw bigWig files), and log2 fold enrichment (.log2FE.bw). Refining and repeating the Macs2 peak calling application is not uncommon.

Recalling peaks with different parameters

It’s not unusual, if after evaluation, to decide that the original peak calling parameters were not ideal and should be modified. Rather than rerunning the entire pipeline, peaks can now be recalled in a fraction of the time by reusing the existing q-value tracks using the recall_peaks script. Modifications can be made to the --peaksize, --peakgap, and --cutoff values, as well as the broad (gapped) peak parameters. Any changes affecting alignments, fragments, or lambda background will necessitate rerunning the entire pipeline.

NOTE: This only recalls mean-replicate peaks, and not the merged independent-replicate peaks generated with the --independent option. Recalling independent-replicate peaks currently requires rerunning the entire pipeline. For example,

recall_peaks.pl --in all_chip --out recall --dir ChIP-Seq --peaksize 150 --peakgap 100

The original pipeline --out value is given as the --in value here, and a new output defined. Point the directory to the same output directory as before, and the relevant bigWig files should be found and re-used. Peaks will be recalled, merged, and re-scored as before. New plots will be generated. New subfolders will be generated, with an incrementing digit suffix; original files will not be overwritten. The intersect_peaks script can be used to compare the old and new peak calls, if desired.

Differential peak analysis

When two or more conditions are used for ChIP, then a differential analysis can be applied across the final merged set of peaks to identify those that significantly differential. The R package DESeq2 is one recommended package, among several. The run_DESeq2 script is a convenient, simple, wrapper script for running such analysis.

run_DESeq2.R --count output_counts.txt.gz --sample output_samples.txt \
--first chip1 --second chip2 \
--norm --all --output differential_chip1_chip2 

The count table generated by this pipeline have already been sequencing-depth normalized, hence the use of the --norm option (which sets all size factors to 1) in the script. Since the peaks represent a tiny fraction of the genome, it is best to use genome-wide counts for normalization, otherwise true differences between ChIP peaks may be normalized away if DESeq2 is allowed to do it by default.

The script generate_differential will generate a differential track between two log2FE or coverage bigWig (or bedGraph) files by subtracting the second from the first. Both tracks set a minimum value before subtraction, to avoid regions that are not initially enriched. Enriched peaks from each respective ChIP may be called by defining an absolute delta difference (no statistical call is made).

generate_differential.pl --in1 chip1.log2FE.bw --in2 chip2.log2FE.bw \
--min 0.5 --delta 1 --len 200 --gap 50

Macs2 also has a differential analysis mode. This uses four input files, the fragment pileup and lambda control files for both ChIPs.

macs2 bdgdiff -t1 <file1> -t2 <file2> -c1 <file3> -c2 <file4> \
--d1 <depth> --d2 <depth> -C <cutoff> --o-prefix <basename>

This uses log likelihood for differential detection. In my experience, the R packages work better.

Manual intersection and scoring of peaks

Sometimes you want to intersect a subset of the peaks, perhaps filtering the peak files themselves or not including a sample. To regenerate a new combined list of peaks and re-score them for making new plots, you can follow the steps below. This is essentially what was run automatically by the pipeline script (compare with the output of the pipeline). Adjust accordingly for your situation.

intersect_peaks.pl --out new_list sample1.narrowPeak sample2.narrowPeak ...

get_datasets.pl --method mean --in new_list.bed --out new_list_qvalue.txt \
--format 3 *.qvalue.bw

get_datasets.pl --method mean --in new_list.bed --out new_list_log2FE.txt \
--format 3 *.log2FE.bw

get_datasets.pl --method sum --in new_list.bed --out new_list_counts.txt \
--format 0 *.count.bw

cp *_samples.txt new_list_samples.txt

plot_peak_figures.R -i new_list

In the above example, be sure to edit the _samples.txt file to reflect the new samples, if necessary.