This guide describes the steps used to run STAR and other bioinformatic tools on the Fastq files in GNomEx. All the commands listed in steps #2 to #6 are organized in a cmd.txt file, which is used to send jobs to the CHPC clusters following the Pysano reference manual. This file also includes shell scripts to parse sample names from FASTQ files and then read and write to different output files with that prefix.

1. Create reference

Download the mouse FASTA and GTF file from Ensembl release 94 and run STAR with the genomeGenerate option to create the reference database. The sjdbGTFfile option extracts splice junctions from the GTF file with a maximum possible overhang of 49 bases (for 50 bp reads).

wget ftp://ftp.ensembl.org/pub/release-94/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
wget ftp://ftp.ensembl.org/pub/release-94/gtf/mus_musculus/Mus_musculus.GRCm38.94.gtf.gz
gunzip *.gz

STAR --runMode genomeGenerate \
     --genomeDir star50 \
     --runThreadN 24 \
     --genomeFastaFiles Mus_musculus.GRCm38.dna.primary_assembly.fa \
     --sjdbGTFfile Mus_musculus.GRCm38.94.gtf \
     --sjdbOverhang 49

2. Trim adapters

Trim the Illumina adpater sequence using cutadapt version 1.16. The -O option starts trimming after 6 matching bases and -m option will discard trimmed reads shorter than 20 bases

cutadapt -j 24 -O 6 -m 20 \
 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
 -o 14176X1.fq 14176X1_170425_D00550_0428_ACB3GUANXX_4.txt.gz

3. Run FastQC

Run FastQC on the trimmed fastq file. This tool ouputs sequence length distributions, overrepresented sequences and many other statistics.

fastqc -f fastq 14176X1.fq

4. Align reads

Align the trimmed reads using STAR version 2.5.4a to the mouse reference in two pass mode and output a BAM file sorted by coordinates and unstranded bedGraph file. The quantMode option outputs alignments to transcript coordinates for RSEM.

STAR --genomeDir star50 \
     --runThreadN 24 \
     --readFilesIn 14176X1.fq \
     --twopassMode Basic \
     --outSAMtype BAM SortedByCoordinate \
     --quantMode TranscriptomeSAM \
     --outWigType bedGraph \
     --outWigStrand Unstranded

5. Count features

Rename the STAR output BAM file and run featureCounts version 1.5.1 to count uniquely aligned reads overlapping features in the GTF file.

mv Aligned.sortedByCoord.out.bam 14176X1.bam
featureCounts -T 24 -s 2 --largestOverlap -a Mus_musculus.GRCm38.94.gtf 14176X1.counts 14176X1.bam

STAR reports unique reads and reads that map to 2 to 10 locations on the genome. These multi-mapping reads can be assigned to features using RSEM, which returns the expected counts to transcripts and genes in 14176X1.isoforms.results or 14176X1.genes.results. These count tables are not used in the differential expression analyses unless requested.

6. Check quality

Run Picard CollectRnaSeqMetrics to count the number of reads matching exons, UTRs, introns and intergenic regions and to calculate the normalized gene coverage from the top 1000 expressed transcripts. Finally, run samtools idxstats to count the number of reads per chromosome.

7. Summarize alignments

MultiQC searches the Alignments directory for analysis logs and compiles an HTML report that includes interactive summary tables and plots for all commands in the cmd.txt file.

multiqc Alignments

The General Statistics table at the top of the report includes summaries from FastQC like total sequences (M Seqs), STAR (M Aligned), featureCounts (M Assigned) and collectRNASeqMetrics (% rRNA and mRNA). The remaining sections summarize outputs from each program.

8. View alignments

Load a BAM file into a database browser like IGV by clicking the URL link icon next to the BAM file name in GNomEx. Copy and paste the link into IGV by selecting “File, Load from URL”. Search for a gene or zoom into a specific region to check alignments. If reads do not align with annotations, make sure the correct reference assembly is selected (hg38, mm10, rn6, etc.). Also, for stranded Illumina single read sequencing runs, the reads in the BAM file will align in the opposite direction of the feature, so the -s 2 option in featureCounts was used to count reversely stranded reads.

To compare many samples, it’s easier to load the bigWig files in GNomEx. The STAR aligner normalizes coverage files by default, so the units are reads per million mapped reads (RPM). In addition, separate coverage files are created for unique reads (*.unique.bw) and unique plus multi-mapping reads (*.multiple.bw).

IGV displays a normalized coverage plot and RPM values at each position. If needed, multiply the RPM by the number of unique alignments from STAR in the MultiQC file to get the total reads at that base (for example, 24.45863 RPM * 21873674 uniquely aligned reads/1000000 = 535 total reads). Finally, select multiple tracks and right click to set the same minimun and maximum values to display in the y-axis range by selecting Group Autoscale.

9. Differential Expression

See the DESeq.html report for further details on the differential expression analysis using DESeq2 version 1.22.0.