UMI Scripts - Usage
Home | Install | Usage | Applications |
Introduction
Universal Molecular Indexes, or UMIs, are for uniquely identifying the molecule from
which a sequencing read is derived, and is primarily used to differentiate reads from
unique biological molecules as opposed to a duplicate molecule derived from PCR
amplification or flow-cell spot duplication (optical duplicate). In essence, they are
oligomers of any nucleotide (NNNNN
) in the adapter oligo. UMI sequences may be
included inline with the primary sequencing read (usually at the beginning of the
read, either with or without an interposing fixed sequence), or read as a separate
sequencing read, similar to sample barcode index tags for de-multiplexing.
UMIs are common in single-cell sequencing. Since single-cell sequencing have their own workflows, this document deals with regular DNASeq and RNASeq alignments.
The UMI sequence typically doesn’t provide enough entropy to be used solely for de-duplication, so it must be used in conjunction with the read sequence. This is best done after the alignment step, since the alignment coordinates can then be used in combination with the UMI to determine uniqueness.
Preparing for alignment
Keeping the UMI during the alignment step is tricky. It can’t be part of the primary
read, since it won’t align to the genome, so it has to be associated through the use
of alignment tags. The current
SAM specification uses the RX
tag for unique molecular index sequences, along with QX
for UMI sequence quality.
Unfortunately, by default, the Fastq file format doesn’t allow for the use of extra metadata, so alternatives must be made.
-
Attach the UMI sequence to the read name. This makes it part of the read name and will be incorporated in the alignment. This is a non-standard method, but universally accepted by all aligners since they all keep the read name.
-
Convert the UMI to SAM formatted attribute tags and stick them in the Fastq read comment or description field, the part after the read name, delineated by a space or tab. Some aligners can incorporate this into the SAM output.
-
Convert the Fastq file into an unaligned Bam file, where the UMI is incorporated as a proper attribute tag. Some aligners can read directly from unaligned Bam files.
The method chosen is mostly dependent on the aligner software to be used, and to a lesser extent, the de-duplication software used.
UMIScripts tool example
On local HCI Linux machines or using our CHPC module repository, you will need to load the module.
module load umiscripts
Separate UMI Fastq file
For a separate Fastq file, one can use the merge_umi_fastq.pl tool. It can take multiple Fastq files and sort them appropriately as Read1, Read2, and UMI based on name and size, or they can be explicitly set with appropriate command line options. The output file name or base name can optionally be provided.
-
By default, it will generate SAM tags in the Fastq comment.
merge_umi_fastq.pl *.fastq.gz
-
To append the UMI sequence to the read name, use the
--name
option.merge_umi_fastq.pl --name *.fastq.gz
-
To write as an unaligned Bam file, use the
--bam
option. A copy of samtools must be in thePATH
.merge_umi_fastq.pl --bam *.fastq.gz
The application can process 1 million paired-end reads in about 15-20 seconds, but this is entirely dependent on the machine. Output may be streamed to standard out if desired.
Embedded UMI Fastq file
When the UMI is embedded inline to the read, it must be extracted. It may be in the first read, second read, or both. The embedded_UMI_extractor.pl application can do this. It assumes that the UMI sequence is at the 5’ end of the read, separated by a fixed sequence. For example,
NNNNNNNNNNNNGATCxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
where N
is the UMI sequence, GATC
is the fixed sequence, and xxx
represents the
sequence read. An example command, with UMI length of 12 and fixed sequence of GATC
,
would then be
embedded_UMI_extractor.pl --read1 R1.fastq.gz --length 12 --fixed GATC
Paired-end reads can also be handled. The UMI by default is assumed to be on both reads,
otherwise it can be specified using the --umi
option with values 1
, 2
, or 12
(for both).
embedded_UMI_extractor.pl --read1 R1.fastq.gz --read2 R2.fastq.gz --umi 12 --length 12 --fixed GATC
By default, SAM tags are generated in the Fastq comment. Alternatively, they can be
appended to the name using the --name
flag, or an unaligned bam file can be written
using the --bam
option.
Aligners
The choice of aligner dictates the method used. Below are examples for the four common aligners used in the Core.
Output is piped into samtools fixmate
to append the mate score, which can be used
for ranking alignments during de-duplication later on.
BWA
BWA will accept UMI in the read name, as a Fastq comment, and as an unaligned Bam file with an extra step.
To align Fastq with SAM tag comments:
bwa mem -t $CPU -C reference.fasta *.fastq.gz | \
samtools fixmate -m - output.bam
To align with an unaligned Bam file, it must first be converted to Fastq with tags
as comments using samtools
as a pre-step:
samtools fastq -T RX,RQ unaligned.bam | \
bwa mem -t $CPU -C -p reference.fasta - | \
samtools fixmate -m - output.bam
STAR
STAR does not accept Fastq comments, but it will accept an unaligned Bam file, as of version 2.6. It will also of course accept UMI in the read name. You must designate the command for reading the input file, and designate single-end “SE” or paired-end “PE”.
STAR --runMode alignReads \
--readFilesIn unaligned.bam \
--readFilesCommand samtools view \
--readFilesType SAM PE \
--readFilesSAMattrKeep RX QX \
...
STAR will also accept a SAM text file as an alternative to a Bam file, in which
case the --readFilesCommand
should be changed as appropriate.
Bowtie2
Bowtie2 will accept unaligned Bam files as of version 2.3, and Fastq comments as of version 2.4. It will of course accept UMI in the name.
To align with unaligned Bam files:
bowtie2 --preserve-tags --align-paired-reads \
-x reference.fasta -b unaligned.bam | \
samtools fixmate -m - output.bam
To align with Fastq with SAM tag comments:
bowtie2 --sam-append-comment \
-x reference.fasta -1 read1.fastq.gz -2 read2.fastq.gz | \
samtools fixmate -m - output.bam
Novoalign
Novoalign appears to have the most versatility, with the ability to handle any format, including extracting the UMI appended to the read name, or even extracting an embedded UMI sequences from the main sequence read.
To extract UMI appended to the read name:
novoalign -d index.nix -f *.fastq.gz \
--umi2 RX --tune NOVASEQ -o SAM | \
samtools fixmate -m - output.bam
To align with Fastq SAM comments:
novoalign -d index.nix -f *.fastq.gz \
-C --tune NOVASEQ -o SAM | \
samtools fixmate -m - output.bam
To align an unaligned Bam file:
novoalign -d index.nix -f unaligned.bam \
-F BAM RX,QX --tune NOVASEQ -o SAM | \
samtools fixmate -m - output.bam
De-duplication of alignment files
Once you have alignment files, they may be de-duplicated. The Bam files must be sorted and indexed. Duplicates at any given coordinate are checked for the UMI and the best one is kept, while the remaining are marked or discarded. There are two programs that can be used.
Picard
The Picard tool
UmiAwareMarkDuplicatesWithMateCigar
is the most complete and thorough application, at the great expense of speed, i.e. it
is painfully slow! It can only accommodate UMI sequences in a SAM attribute tag,
RX
by default.
java -jar picard.jar UmiAwareMarkDuplicatesWithMateCigar \
-I input.bam -O output.bam \
-M marked_dup_metrics.txt -UMI_METRICS umi_metrics.txt \
--OPTICAL_DUPLICATE_PIXEL_DISTANCE 2500 --UMI_TAG_NAME RX
Samtools
As of version 1.16 (released in 2022), samtools
added the capability to use UMIs in the markdup
function to mark (or remove)
duplicates. It will extract the UMI sequence from a SAM attribute tag (--barcode-tag
)
or from the name (--barcode-name
). No mismatches in the UMI are tolerated.
samtools markdup --barcode-tag RX -d 2500 input.bam output.bam
bam_umi_dedup
The included bam_umi_dedup.pl application is considerably faster than Picard, but with a notable caveat: no guarantee is made for retaining identical molecules at secondary, supplementary, and inter-chromosomal alignments (they are treated independently). Otherwise, for normal alignments, results are comparable to other tools. For most count-based applications such as ChIPSeq or RNASeq, this limitation may be acceptable.
By default, duplicates are discarded, or they can be marked with the --mark
option.
Unmapped alignments are silently discarded (an unfortunate side effect of
multi-threading by reference target sequences).
For de-duplication with SAM attribute tag RX
(default):
bam_umi_dedup.pl --in input.bam --out output.bam --cpu $CPU
For de-duplication with the UMI appended to the read name:
bam_umi_dedup.pl --in input.bam --out output.bam --umi name --cpu $CPU
One mismatch (by default) is tolerated when de-duplicating with UMI sequences. This
is an advantage over the other tools which do not tolerate mismatches. Note that
extreme depth may substantially slow down execution time when mismatches are
tolerated. For bam_umi_dedup
, a maximum depth is allowed for mismatch tolerance
before mismatch checking is dropped to avoid extreme impact (runtime of days!?).
Dropping mismatch tolerance completely will improve runtime.
Performance
For a human (hg38) WGS Bam file with 668 M alignments, the Picard tool (version 2.26.3)
completes in 7.5 hours, while the bam_umi_dedup
tool completes in 48 minutes with 16
threads and one mismatch tolerated. The Samtools tool was not tested with this file, but
subsequent testing suggests it is comparable to the bam_umi_dedup
but requiring fewer
threads, i.e. it’s more performant. Results may vary.