Aligning RNA-seq data

The theory behind aligning RNA sequence data is essentially the same as discussed earlier in the book, with one caveat: RNA sequences do not contain introns. Gene models in Eukaryotes contain introns which are often spliced out during transcription. RNA Sequences that span two exons will have a hard time mapping to the genome, which still contains the code for introns.

One solution is to map the RNA sequence data to the predicted RNA molecules. However there are several disadvantages of mapping the RNA sequences directly to the predicted transcripts:

  1. We have to rely on the accuracy of the gene structure prediction method. Many genes structures are incomplete and also inaccurate.
  2. The genes we identify in the RNA samples may not be annotated in the genome yet. In this case, we would completely miss the gene from our analysis.
  3. Mapping the sequences to the genome can help us identify the genes that are missing from our annotation or annotated incorrectly.

Tophat

There are several softwares out now that have been developed specifically to help align RNA reads to the genome. One of the earliest and most popular is Tophat. Tophat is built on top of Bowtie ( another popular short read aligner aligned based on BWT ).

(Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009;25:1105–1111.)

Other RNA-seq aligners

There have been some newer RNA-seq aligners that are worth considering:

  1. HISAT: http://www.ccb.jhu.edu/software/hisat/index.shtml
  2. STAR: https://code.google.com/archive/p/rna-star/
  3. Kallisto: http://pachterlab.github.io/kallisto/manual.html
  4. Salmon: http://salmon.readthedocs.io/en/latest/salmon.html

Workflow using Tophat as an aligner

Let’s copy the data set into our scratch directory, unzip and untar the files.

Fastqc

First we will start with running the quality control for our sequences. We will use the software Fastqc, and to run this the command is very simple; just type fastqc and then the name of the file. Since there are only 4 files so we could simply run it 4 times, however let’s spend the extra time to create an array job which will look for all the fastq file in the directory and run fastqc. This will save us from significant amount of work in the future.

fastq.array.sh

Fastqc provides an .html file that can be opened in a browser and a .zip file that contains the html file and other information in a parsable text file. You may want to transfer these files to your computer to view them. Our array SLURM job also produced one .err and one .out file for each job that was executed.

A quick glance at the reports show us that :

  1. The sequence quality starts to decrease as we get closer to the 3′ end of the read
  2. Something happened to the tile which caused the middle portion to have a significant lower quality
  3. There are duplicated sequences which tend to be the Illumina adapters
  4. And there is also an over represented K-mers in the 5′
  5. The quality encoding is using Illumina 1.5, so when we run an analysis that will use quality scores, make sure it is using the correct encoding (phred64). For details see (https://en.wikipedia.org/wiki/FASTQ_format)

Trimmomatic

Trimmomatic has options to:

  • Remove leading and trailing nucleotide based on quality or simply a given number of bases
  • Remove sequence when the average quality of a window falls below a certain threshold
  • Remove sequences matching Illumina adapters.

To run Trimmomatic, we can use the same strategy as above.

trim.sh

Tophat

In order to align your RNA sequences to the genome with Tophat, you have to first create the database files using bowtie. bowtie2-build needs the fasta file as the first argument followed by the prefix to be used for the database index files.

bowtiebuild.sh

Then we can take each sample and align them against the Arabidopsis genome.

tophat.sh

You want to run this command separately for all the samples. The option to align paired reads is slightly different.

Now we can sort and de-duplicate the reads. Sorting and indexing are common transformations that allow applications to process the data efficiently. Picard has an excellent collection of tools that can be used. There is also samtools, which has some equivalent functions, however I find that I run into fewer errors when I use Picard. (https://broadinstitute.github.io/picard/)

dedup.sh

Now that we have the alignment files processed, we can use one many different tools to determine differentially expressed genes.