Skip to the content.

MultiRepMacsChIPSeq - bam_partial_dedup

Home Overview Usage Variations Applications Install

bam_partial_dedup.pl

A script to remove excessive duplicate alignments down to an acceptable fraction. This is in contrast to traditional duplicate removers that remove all duplicates, retaining only one alignment per position.

Duplicate reads may be artificial (derived from PCR duplication due to very low input) or natural (multiple DNA fragments enriched at few narrow foci, such as a ChIP peaks). Removing all duplicates can significantly reduce high-enrichment peaks, but not removing duplicates can lead to false positives. An optimal balance is therefore desirable. This is most important when comparing between replicates or samples.

This script can randomly subsample or remove duplicate reads to reach a target duplication rate. This results in a more uniform duplicate reduction across the genome, which can be more typical of true PCR duplication. Set the target duplication fraction rate with the --frac option below.

This script can also simply remove excessive duplicate reads at positions that exceed a specified target threshold. This can be set either alone or in combination with the random subsample. Using alone is generally not recommmended, as it reduces signal at extreme peaks without addressing low level duplication elsewhere across the genome.

For ChIPSeq applications, check the duplication level of the input. For mammalian genomes, typically 5-20% duplication is observed in sonicated input. For very strong enrichment of certain targets, it’s not unusual to see higher duplication rates in ChIP samples than in Input samples. Generally, set the target fraction of all samples to the lowest observed duplication rate.

Single-end aligment duplicates are checked for start position, strand, and calculated alignment end position to check for duplicates. Because of this, the numbers may be slightly different than calculated by traditional duplicate removers.

Paired-end alignments are treated as fragments. Only properly paired alignments are considered; singletons are skipped. Fragments are checked for start position and fragment length (paired insertion size) for duplicates. Random subsampling should not result in broken pairs.

Optical duplicates, arising from neighboring clusters on a sequencing flow cell with identical sequence, may now be checked. When random subsampling duplicates, optical duplicates should critically be ignored. This is highly recommended for patterned flow cells from Illumina NovaSeq or NextSeq. Set a distance of 100 pixels for unpatterned (Illumina HiSeq) or at least 10000 for
patterned (NovaSeq). By default, optical duplicate alignments are not written to output. To ONLY filter for optical duplicates, set --max to a very high number. Note that tile-edge duplicates are not counted as such.

Existing alignment duplicate marks (bit flag 0x400) are ignored.

Since repetitive and high copy genomic regions are a big source of duplicate alignments, these regions can and should be entirely skipped by providing a file with recognizable coordinates. Any alignments overlapping these intervals are skipped in both counting and writing.

USAGE:

bam_partial_dedup.pl --in input.bam

bam_partial_dedup.pl --frac 0.1 --in input.bam --out output.bam

VERSION: 5.8

OPTIONS:

Required:
  --in <file>         The input bam file, should be sorted and indexed

Alignment Filtering:
  --qual <int>        Skip alignments below indicated mapping quality (0)
  --pe                Bam files contain paired-end alignments and only 
						properly paired duplicate fragments will be checked for 
						duplication. Singletons are silently dropped.
  --size <int>,<int>  Set the minimum and maximum allowed paired insert size
						Pairs outside this range are silently discarded.
  --exclude <file>    Provide a bed/gff/text coordinate file of regions to skip
  --chrskip <regex>   Provide a regex for skipping unwanted chromosomes
						Example: "chrm|mt|random|chrun"

Duplication Settings:
  --frac <float>      Decimal fraction representing the target duplication 
						rate in the final file. 
  --max <int>         Integer representing the maximum number of alignments 
						at each position. Set to 1 to remove all duplicates.
  --optical           Enable optical duplicate checking
  --distance <int>    Set optical duplicate distance threshold.
						Use 100 for unpatterned flowcell (HiSeq) or 
						2500 for patterned flowcell (NovaSeq). Default 100.
						Setting this value automatically sets --optical.

Options:
  --out <file>        The output bam file containing unique and retained 
						duplicates; optional if you're just checking the 
						duplication rate.
  --mark              Write non-optical duplicate alignments to output marked 
						with flag bit 0x400.
  --report            Write duplicate distance report files only, no de-duplication
  --keepoptical       Keep optical duplicates in output as marked 
						duplicates with flag bit 0x400. Optical duplicates 
						are not differentiated from non-optical duplicates.
  --coord <string>    Provide the tile:X:Y integer 1-base positions in the 
						read name for optical checking. For Illumina CASAVA 1.8 
						7-element names, this is 5:6:7 (default)
  --seed <int>        Provide an integer to set the random seed generator to 
						make the subsampling consistent (non-random).

General:
  --cpu <int>         Specify the number of threads to use (4) 
  --verbose           Print more information
  --help              Print full documentation