DESeq 2

The Dataset

DESeq2 manual

Our goal for this experiment is to determine which Arabidopsis thaliana genes respond to nitrate. The dataset is a simple experiment where RNA is extracted from roots of independent plants and then sequenced. Two plants were treated with the control (KCl) and two samples were treated with Nitrate (KNO3).

The sequences were processed to remove all low quality sequences, trim all low quality nucleotides, and finally aligned against the Arabidopsis genome using TopHat. Our analysis starts from here.

Download from Google Drive

We have been provided the following files:

  1. 4 Bam files – An alignment file, one for each sample
  2. Arabidopsis.gtf file – which contains information about the genes in Arabidopsis and where they are located in the genome.
  3. Experiment design – a comma separated file containing meta data.
  4. Gene description – Description about the function of the genes in Arabidopsis.

If you are doing this for the first time, you will have to install some packages first.

If you are on Dalma, please see special instructions on how to install the packages below.

The Data Files

We will use the result from the previous Aligning RNA-seq section.

It’s a good idea to start R from within the directory where the files are located.

BAM files are provided in this folder just in case you are not able to finish the alignment in time

Running R in interactive mode and modules to load.

###

The Alignment Files

The alignment files are in bam format. This files will not be loaded into R, but rather simply pointed to by a reference/variable. The alignment files provided are about 15x smaller compared to an average RNA-seq sample run today. Also there will be triplicates of 3 or more different conditions resulting in much more than 4 sample. So you can imagine the amount of space and memory R would need if all that data was stored in the workspace.

To do point to the bam files we will use Rsamtools

The Annotation File

GTF file is very similar to a GFF file. Their purpose is to store the location of genome features, such as genes, exons, and CDS. It is a tab delimited flat file which looks something like this.

We will load the GTF file using GenomicFeatures and group the exons based on the gene annotation.

The Experimental Design

We will need meta information about our samples to define which sample is the KCL and which is the KNO3. A simple comma separated file can be created using a text editor where the first column has the sample file names and the second column has the categories for the samples. The file should look something like this:

Now let’s load the file into our workspace. The commands read.table, read.csv, and read.delim are all related with the exception of different default parameters. They all read a text file and load it into your workspace as a data.frame. Additonally it loads all columns that are character vectors and makes them factors. The first option for all the different versions is the neame of the file. After that, unless you have memorized the options for the commands, explicitly state which parameters you are providing.

Counting the Reads

Now we will use the GenomicAlignments package to determine where the reads are mapping. Remember we decided to map the sequences to the genome so we can make new discoveries such as novel genes, alternative splicing, and also anti-sense transcripts. In the event that the reference genome is not sequenced, one would have to assemble the RNA-seq reads first to identify all the genes that were detected in the RNA-seq samples. However assembling the transcriptome is quite a complicated process and requires a lot of time and manual curation to produce quality transcripts. We are fortunate that we are using Arabidopsis data whose genome was sequenced in 2000.

The function summarizeOverlaps takes the bam files and the gene annotation to count the number of reads that are matching a gene. The union mode means if a read matches an area where two genes overlap, then it is not counted. We can also provide other parameters such as whether sequence was single-end or pair-end.

Filtering the Counts

Now, if you remember from the lecture, genes that are expressed at a very low level are extremely unreliable. We will remove all genes if neither of the groups ( KCL or KNO3 ) have a median count of 10 and call the new dataframe counts_filtered.

Differentially Expressed Genes

Now that we have the counts table filtered, we can start to determine if any of the genes are significantly differentially expressed using DESeq. DESeq performs a pairwise differential expression test by creating a negative binomial model.

Now we can create an object that DESeq needs using the function newCountDataSet. In order to create this dataset, we need the filtered data frame of read counts and the factor that will help group the data based on the condition.

Before the actual test, DESeq has to consider the difference in total reads from the different samples. This is done by using estimateSizeFactors function.

Next DESeq will estimate the dispersion ( or variation ) of the data. If there are no replicates, DESeq can manage to create a theoretical dispersion but this is not ideal.

The plot shows us that as the gene’s read count increases, dispersion decreases, which is what we expect. Now we will tell DESeq what we would like to compare. Then we will use the adjusted p-value ( p-value corrected for multiple hypothesis testing ) for our cutoff.

The sum command is normally used to add numberic values. However in logic vectors, TRUE=1 and FALSE=0, so we can use the sum function to count the number of “TRUE” in the vector. In the example we counted the number of genes that have a an adjusted p-value less than 0.05.

MA Plot

Here’s an MA plot that shows as the average count of a gene increases, a smaller fold change is needed for something to be significant. For this reason, it is often helpful to require that the log2foldchange also be greater than or less than negative of some cutoff.

Significant genes

Let’s use the same values for our cutoff to determine which genes we want to consider as significantly differentially expressed. The resSigind variable will contain genes that are induced and resSigrep will contain genes that are repressed when the plants are treated with Nitrate. To create one dataframe of differentially expressed genes, let’s combine the two dataframe. We can use the rbind command because the columns are the same in both sets. To show the name of the genes, simply look in the id column of the dataframe.

Gene Annotations

Great ! We have genes that are differentially expressed, what do we know about these genes ? The gene identifier we obtained from the GTF file is referred to as TAIR identifiers (a consortium that used to release Arabidopsis genome annotations) I managed to download the gene description for all the genes. Let’s load them into the workspace and find out what are the names of the genes.Since the set of repressed genes is smaller, let’s see what we can find out about them.

Before we do this, note that the identified in the gene_description file is slightly different ( the file contains the transcript identifier that ends with “.” and a number), Let’s replace every occurrence of . and a number with nothing. Then we will be able to use match to find where our gene is in the description file so we can only print out that row.

GO-term enrichment analysis

Getting the gene names are informative but it is hard to determine whether a speicifc category of terms are over-represented in this analysis. So we will perform a GO-term enrichment analysis of all differentially expressed genes using a Hypergeometric test to determine if there is any GO-term that is enriched in this list.

To run the test using the GOstats package, we must first create a GOHyperGParams object as shown below. Here we will provided the genes in the list, the universe of all genes, which GOterm gene space we are interested in, a p-value cutoff, and if we want an additional conditional test. The conditional test looks at parent and child relationship of the GO-terms. If the parent term and the child term are over-represented, the parent-term has to be significantly more over-represented to show up in the list of over-represented GO-terms. This avoids the problem where many over-represented GO-terms are parent or child of another and the terms appear to be very redundant.

To get actually perform the test we use the hyperGTest function and to see the results simply use the summary function.

Creating Heatmaps

Clustering is a powerful way of grouping together genes based on similarity of their expression profiles. Here we
will use an enhanced version of the heatmap function that has built-in clustering method.

To use heatmap.2 we need to install gplots.
Please note that gplots is not a part of Bioconductor so you have to install it the standard way by using

By default the built-in clustering method uses Euclidean distance to determine similarity of genes instead of pearson correlation and uses the complete linkage method instead of average. Luckily heatmap.2 allows us to redefine the distance and hclust functions by providing our own. Below are two functions that allow us to make the adjustment.

DESeq also provides a normalized version of the counts data which will become useful when performing advanced transcriptome analysis such as clustering. So let’s retrieve the normalized values and then create a new dataframe with expression values of just the differentially expressed genes.

Now we will run the heatmap.2 function to create heatmap of the differentially expressed genes.

The heatmap shows two distint groups, a larger one going up, and a smaller one going down.

#Clustering

Now id we wanted to get the gene names for the two clusters we can perform the clustering on our own and save them in different vectors.

The hclust function is performing the heirarchichal clustering. Note that we have to first transpose the normalized matrix before we perform the correlation since it does this column by column. We then convert the correlation value to a distance measure by subtracting it from 1 and finally we tell R to make it into a distance object.

cutree is deciding where to cut the tree. We can use the k option, which allows you to specific the total number of final groups you want or the h option which allows you to pick a distance cutoff. The result is simply a vector of group identifiers (either 1 or 2) and then we simply identify which genes are in which cluster group.

Now you can do your Go-term enrichment on each group !