MultiRepMacsChIPSeq - Variations
Home | Overview | Usage | Variations | Examples | Applications | Install |
Variations
The pipeline may be used with other chromatin sequencing assays beyond ChIP-Seq. Below are some notes for doing so for ATAC-Seq, Cut-and-Run, Cut-and-Tag, and MNase.
Variation with ATAC-seq
ATAC-Seq uses Tn5 transposase to probe chromatin structure in situ by cutting accessible or “open” DNA, often between nucleosomes and around active loci, such as promoters and enhancer elements. The release of a chromatin fragment occurs when two independent transposition events occur in proximity. ATAC-Seq libraries are generally sequenced as paired-end libraries so that fragment size can be determined.
There is no “Input” sequence, although naked DNA prepared with Tn5 could in theory be used. Enrichment is determined by using a genomic (chromosomal) mean coverage derived from the experimental ATAC-Seq fragment coverage.
Mitochondrial DNA contamination is often high with ATAC-Seq data, so mtDNA should
definitely be excluded with the --chrskip
option.
Genomic coverage can be sparse, so to get an appropriate enrichment calculation,
the --genome
parameter should be manually set to the mappable size of your
genome build.
Duplication rate is frequently high with ATAC libraries, with a mix of high
biological duplication (cut sites are restricted by chromatin accessibility, limiting
potential diversity) and PCR amplification from library preparation. If you keep
duplicates (often recommended due to the biological source), sub-sampling to a
consistent rate is strongly recommended. The rate can be kept at a higher fraction
than with ordinary ChIP-Seq (--dedup 0.1
or 0.2
). Optical duplicates absolutely
need to be discarded in this scenario (--optdist
) as appropriate for the sequencing
platform.
There are two general strategies to look at ATAC-Seq data. I usually do both.
-
Fragment analysis
An analysis of the fragments generated by ATAC-Seq as a measure of general DNA accessibility, i.e. the higher fragment coverage indicates a higher degree of open chromatin. This basically treats the ATAC-Seq fragments like a ChIP-Seq, but without a control reference.
If desired, analysis of specific size ranges can be performed. For example, one can restrict to sub-nucleosomal (30-120 bp) or nucleosomal (130-175 bp) fragments. Use the
--min
and--max
options to specify acceptable paired-end size ranges. For example,--pe \ --min 30 \ --max 120 \
-
Cut site analysis
An analysis of the actual cut sites to get a higher resolution analysis of explicitly open DNA (or DNase Hyper Sensitive) sites. This analysis is often chosen to examine potential transcription factor binding sites that frequently occur at HS sites. Technically, this is a more appropriate analysis, since the cut sites are independent events without regard to the intervening DNA.
To run this analysis, simply include the
--cutsite
option (formerly--atac
), which simultaneously sets a number of default parameters. Specifically, a short fragment coverage is generated centered over each cut site (the alignment ends), and peak calls are made. Cut-site point data (.count.bw
files) are generated with 5 bp shifts in case fine-mapping of transcription factors is pursued.In
--cutsite
mode, the following parameters are automatically set:--dedupair \ --paired \ --minsize 30 \ --maxsize 2000 \ --fragsize 50 \ --shiftsize -25 \ --peaksize 90 \ --peakgap 30 \
Variation with Cut-and-Run
Cut-and-Run experiments use an antibody-targeted MNase fusion protein to release specific chromatin fragments in situ that contain the targeted epitope, leaving the majority of the genome mostly intact. Since only the released, and presumably targeted, fragments are recovered and sequenced, the signal should be distinct relative to a low background. Hence, it is analogous to ChIP-Seq, but without the complications of pull-down and enrichment over background. Consequently, sequencing depths are usually quite low. This technique benefits greatly from biological replicates, which can help reduce spurious peak calls.
Duplication rates tend to be higher than standard ChIP-Seq due to extremely low DNA quantities in the library preparation and PCR amplification. While biological duplication may also be likely due to restricted sites of chromatin digestion, the high PCR duplication makes retaining duplicates dubious and could yield higher false positives. Therefore, duplicate sub-sampling can be used but should be limited to low rates.
Sequencing depths may not be very high, so manually setting the --genome
size
may be necessary.
Since no Input or reference chromatin is sequenced, there will be no --input
parameters provided. In this case, a genomic (chromosomal) mean coverage derived from
the target sequencing coverage is used as the reference to calculate enrichment. This
is usually sufficient for calling peaks, although tweaks to the peak-calling
parameters, --cutoff
, --peaksize
, and --peakgap
, are usually required.
Typically a non-specific or generic IgG antibody is used as a control to account for
hot spots and false-positive peaks. If the genomic coverage from this sequenced
sample is sufficiently high, say > 75% of genome size, it could be used as “Input” to
the “ChIP” samples. However, it is usually best to simply treat this as a separate
sample, and use the peaks called from IgG as exclusion peaks for the experimental
samples. This will necessitate running the pipeline twice, once on IgG samples, and
again on experimental samples, where you will set the --exclude
parameter to the
combined output bed file from the IgG samples.
External spike-in genome normalization, usually bacterial or yeast, are generally not
helpful with peak calling as it can distort the statistics of peak calling. If
they’re included in the bam file, be sure to exclude these with the --chrskip
option. The normalization factors could be used in subsequent visualization or
differential analysis by manually re-running the
bam2wig.pl application
with explicit scaling factors.
Variation with Cut-and-Tag
This is very similar to Cut-and-Run, except that a Tn5 transposase fusion is used to digest the chromatin rather than MNase. The advantage is that the library preparation is considerably easier, since adapters are integrated onto the cut-ends without having to do ligation, as with Cut-and-Run.
The disadvantage, however, is that non-targeted and spurious Tn5 digestion does NOT yield a mostly uniform, non-specific, background as with Cut-and-Run, but rather may mirror a more typical ATAC-Seq experiment. In other words, open chromatin regions are digested, which means most promoters will be called as peaks in non-targeted or IgG samples. In practical terms, this means that these peaks should probably not be used as exclusion regions for your experimental samples, as with Cut-and-Run. Obviously, this needs to be evaluated on a case-by-case basis.
Otherwise, most of the parameters that apply to Cut-and-Run also apply to Cut-and-Tag.
Variation with MNase-Seq
Processing MNase-Seq files can be done, but with caveats. Generally, “peaks” are not necessarily called in these types of analysis for a few reasons. First, the number of differences are so many that it considerably reduces statistical significance scores for peaks to be called. Second, shifts in nucleosomes can create differences that are generally too small to be reliably called. Third, differences between conditions can go in either direction, up or down, and Macs2 is not suited for these types of differential significance calls.
In personal experience, calling differences as a delta coverage score is generally more reliable. See the generate_differential.pl application as a convenient tool to do this.
Duplication is also a concern, and similar to ATAC-Seq, MNase-derived nucleosome fragments exhibit extremely high biological duplication. The same cautions should be applied with MNase-Seq as with ATAC-Seq (see above).
Reference genome scaling
Using a reference genome for controlling enrichment between conditions, such as including Drosophila chromatin in your ChIP assay of human chromatin as described, for examaple in Orlando et al, is an advanced technique of analysis. Unfortunately, this normalization scale can not be used in peak calling, since it artificially skews the coverage depth and breaks the assumptions of equality (null hypothesis) between ChIP and control required for making a statistically confident peak call. It is best to call peaks without normalization, preferably in a normal or wild type situation, and then assay those peaks with external-genome normalized coverage tracks in subsequent analysis.
Earlier versions of this pipeline included options for scaling. These options remain as advanced options but are not detailed.