MultiRepMacsChIPSeq - Variations

Variations

The pipeline may be used with other chromatin sequencing assays beyond ChIP-Seq. Below are some notes for doing so for ATAC-Seq, Cut-and-Run, Cut-and-Tag, and MNase.

Variation with ATAC-seq

ATAC-Seq uses Tn5 transposase to probe chromatin structure in situ by cutting accessible or “open” DNA, often between nucleosomes and around active loci, such as promoters and enhancer elements. The release of a chromatin fragment occurs when two independent transposition events occur in proximity. ATAC-Seq libraries are generally sequenced as paired-end libraries so that fragment size can be determined.

There is no “Input” sequence, although naked DNA prepared with Tn5 could in theory be used. Enrichment is determined by using a genomic (chromosomal) mean coverage derived from the experimental ATAC-Seq fragment coverage.

Mitochondrial DNA contamination is often high with ATAC-Seq data, so mtDNA should definitely be excluded with the --chrskip option.

Genomic coverage can be sparse, so to get an appropriate enrichment calculation, the --genome parameter should be manually set to the mappable size of your genome build.

Duplication rate is frequently high with ATAC libraries, with a mix of high biological duplication (cut sites are restricted by chromatin accessibility, limiting potential diversity) and PCR amplification from library preparation. If you keep duplicates (often recommended due to the biological source), sub-sampling to a consistent rate is strongly recommended. The rate can be kept at a higher fraction than with ordinary ChIP-Seq (--dedup 0.1 or 0.2). Optical duplicates absolutely need to be discarded in this scenario (--optdist) as appropriate for the sequencing platform.

There are two general strategies to look at ATAC-Seq data. I usually do both. There is often a high correlation in the peak calls between the two strategies, but in general fragment analysis yield larger peaks, e.g. 1-3 Kb, while cut site analysis yield smaller peaks, e.g. 100-500 bp; actual results will vary.

Fragment analysis

An analysis of the fragments generated by ATAC-Seq as a measure of general, or regional, DNA accessibility, i.e. the higher fragment coverage indicates a higher degree of open chromatin. This basically treats the ATAC-Seq fragments like a ChIP-Seq, but without a control reference.

If desired, analysis of specific size ranges can be performed. For example, one can restrict to sub-nucleosomal (30-120 bp) or nucleosomal (130-175 bp) fragments. Use the --min and --max options to specify acceptable paired-end size ranges. For example,
```
  --pe \
  --min 30 \
  --max 120 \
```
Cut site analysis

An analysis of the actual cut sites to get a higher resolution analysis of explicitly open DNA (or DNase Hyper Sensitive) sites. This analysis is often chosen to examine potential transcription factor binding sites that frequently occur at HS sites. Technically, this is a more appropriate analysis, since the cut sites are completely independent Tn5 insertion events without regard to the intervening DNA, i.e. we ignore the fragment itself.

To run this analysis, simply include the --cutsite option (formerly --atac), which simultaneously sets a number of default parameters. Specifically, a short fragment coverage is generated centered over each cut site (the alignment ends), and peak calls are made. Cut-site point data (.count.bw files) are generated with 5 bp shifts in case fine-mapping of transcription factors is pursued.

In --cutsite mode, the following parameters are automatically set:
```
  --dedupair \
  --nopaired \
  --minsize 30 \
  --maxsize 2000 \
  --fragsize 50 \
  --shiftsize -25 \
  --peaksize 90 \
  --peakgap 30 \
```

Variation with Cut-and-Run

Cut-and-Run experiments use an antibody-targeted MNase fusion protein to release specific chromatin fragments in situ that contain the targeted epitope, leaving the majority of the genome mostly intact. Since only the released, and presumably targeted, fragments are recovered and sequenced, the signal should be distinct relative to a low background. Hence, it is analogous to ChIP-Seq, but without the complications of pull-down and enrichment over background. Consequently, sequencing depths are usually quite low. This technique benefits greatly from biological replicates, which can help reduce spurious peak calls.

Duplication rates tend to be higher than standard ChIP-Seq due to extremely low DNA quantities in the library preparation and PCR amplification. While biological duplication may also be likely due to restricted sites of chromatin digestion, the high PCR duplication makes retaining duplicates dubious and could yield higher false positives. Therefore, duplicate sub-sampling can be used but should be limited to low rates.

Sequencing depths may not be very high, so manually setting the --genome size may be necessary.

Since no Input or reference chromatin is sequenced, there will be no --input parameters provided. In this case, a genomic (chromosomal) mean coverage derived from the target sequencing coverage is used as the reference to calculate enrichment. This is usually sufficient for calling peaks, although tweaks to the peak-calling parameters, --cutoff, --peaksize, and --peakgap, are usually required.

Typically a non-specific or generic IgG antibody is used as a control to account for hot spots and false-positive peaks. If the genomic coverage from this sequenced sample is sufficiently high, say > 75% of genome size, it could be used as “Input” to the “ChIP” samples. However, it is usually best to simply treat this as a separate sample, and use the peaks called from IgG as exclusion peaks for the experimental samples. This will necessitate running the pipeline twice, once on IgG samples, and again on experimental samples, where you will set the --exclude parameter to the combined output bed file from the IgG samples.

External spike-in genome normalization, usually bacterial or yeast, are generally not helpful with peak calling as it can distort the statistics of peak calling. If they’re included in the bam file, be sure to exclude these with the --chrskip option. The normalization factors could be used in subsequent visualization or differential analysis by manually re-running the bam2wig.pl application with explicit scaling factors.

Variation with Cut-and-Tag

This is very similar to Cut-and-Run, except that a Tn5 transposase fusion is used to digest the chromatin rather than MNase. The advantage is that the library preparation is considerably easier, since adapters are integrated onto the cut-ends without having to do ligation, as with Cut-and-Run.

The disadvantage, however, is that non-targeted and spurious Tn5 digestion does NOT yield a mostly uniform, non-specific, background as with Cut-and-Run, but rather may mirror a more typical ATAC-Seq experiment. In other words, open chromatin regions are digested, which means most promoters will be called as peaks in non-targeted or IgG samples. In practical terms, this means that these peaks should probably not be used as exclusion regions for your experimental samples, as with Cut-and-Run. Obviously, this needs to be evaluated on a case-by-case basis.

Otherwise, most of the parameters that apply to Cut-and-Run also apply to Cut-and-Tag.

Variation with MNase-Seq

Processing MNase-Seq files can be done, but with caveats. Generally, “peaks” are not necessarily called in these types of analysis for a few reasons. First, the number of differences are so many that it considerably reduces statistical significance scores for peaks to be called. Second, shifts in nucleosomes can create differences that are generally too small to be reliably called. Third, differences between conditions can go in either direction, up or down, and Macs2 is not suited for these types of differential significance calls.

In personal experience, calling differences as a delta coverage score is generally more reliable. See the generate_differential.pl application as a convenient tool to do this.

Duplication is also a concern, and similar to ATAC-Seq, MNase-derived nucleosome fragments exhibit extremely high biological duplication. The same cautions should be applied with MNase-Seq as with ATAC-Seq (see above).

Reference genome scaling

Using a reference genome for controlling enrichment between conditions, such as including Drosophila chromatin in your ChIP assay of human chromatin as described, for examaple in Orlando et al, is an advanced technique of analysis. Unfortunately, this normalization scale can not be used in peak calling, since it artificially skews the coverage depth and breaks the assumptions of equality (null hypothesis) between ChIP and control required for making a statistically confident peak call. It is best to call peaks without normalization, preferably in a normal or wild type situation, and then assay those peaks with external-genome normalized coverage tracks in subsequent analysis.

Earlier versions of this pipeline included options for scaling. These options remain as advanced options but are not detailed.