Pheniqs

What is Pheniqs and why do you need it?

Pheniqs is a fast and accurate high-throughput sequencing barcode classifier.
  • It can convert sequence reads to and from FASTQ, SAM, BAM and CRAM formats, decode barcodes and manipulate read layout in a single process.
  • It is very accurate, much more than traditional minimum distance based methods that simply compare the number of mismatches between two sequences. More specifically is has lower False Discovery Rate and at the same time a much lower Miss Rate. In simple terms it misclassifies less reads, introduces less noise and at the same time recovers more reads.
transform patterns
This plot compares false discovery rates and miss rates between Pheniqs and traditional minimum distance decoding. The X axis shows increasing average error rates. In the error range relevant to most Illumina runs, which is between 1 in 10000 and 1 in 1000, Pheniqs has lower false discovery rates than MDD while still maintaining a much lower Miss Rate.

  • It is very fast. A benchmark of decoding 94 standard Illumina dual indexed barcodes in more than 11.5 billion reads on a CPU with 14 cores completed in just over 3 hours. deML for comparison, another statistical classifier that is still much faster than python implementations, took almost 80 hours to complete the same task.
  • It can handle complex barcode designs with a flexible syntax that can address any number of barcodes anywhere on the read. That means no need for pre or post processing.
transform patterns
In this diagram you can see how Pheniqs uses a syntax similar to python array slicing to address sample, cellular and molecular barcodes on a read made out of 2 biological segments and two index segments and produde a SAM output.

Installing with conda

Pheniqs can be installed in many ways. You can install it with conda, you can use the pheniqs-build-api to build your own portable binary or you can build it from source. You can have a look here for details.

But the good people at the hpc have created a module for us on greene so we will use this for this demo because it takes exactly two seconds to use it 🙂 module load pheniqs/2.1.0

Example files

In this tutorial we will be using a small toy example with 2500 reads. The reads are pair-ended single indexed and were produced by a MiSeq instrument: Two biological segments that are 51 nucleotides long and a single index that is 8 nucleotides long. You are probably not used to seeing Illumina output this way since samples are often demultiplexed by the core facility but the raw output from the sequencer includes each segment in a gzip compressed fastq file with full read quality data.

You can get the three gzip compressed FASTQ files here: BDGGG_s01.fastq.gz, BDGGG_s02.fastq.gz, BDGGG_s03.fastq.gz.

Format conversions and interleaving

  • Both input and output can be FASTQ or SAM/BAM/CRAM in either split or interleaved lacing.
  • Input format is automatically detected from file content, not from the extension, so files without an extension or even with a wrong extension are fine (though still not good practice).
  • Output format defaults to uncompressed SAM on stdout.
  • Output format is determined by the output file extension: fastq, fastq.gz (gzip compressed FASTQ), sam, bam or cram.
  • When output is , or more explicitly /dev/stdout, you must specify the output format with a query parameter: -?format=fastq, -?format=fastq&compression=gz (gzip compressed FASTQ), -?format=sam (not usually necessery since this is the default), -?format=bam, -?format=cram.
  • You can also control compression parameters in a similar fashion, see the manual for details.
Lets see some basic commands with Pheniqs..

  • View a gzip compressed FASTQ file as SAM on stdout
pheniqs mux --input BDGGG_s01.fastq.gz|less
@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --input BDGGG_s01.fastq.gz       VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
Currently Pheniqs has only one sub command called mux but there will be more soon. The only argument we use here is -i/--input. Most command line arguments have both a short form and a long form. The short form is easier to use when typing interactively but the long form is better when documenting things because it makes the command more easy to read. Short form arguments can be aggregated, for instance -s -i input.fastq is the same as -si input.fastq. This is all just standard POSIX stuff which Pheniqs tries to adhere to as much as possible.

  • View a gzip compressed FASTQ file as FASTQ on stdout. Here you can see what you can expect to find in those 3 files.
pheniqs mux --input BDGGG_s01.fastq.gz --output "-?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
pheniqs mux --input BDGGG_s02.fastq.gz --output "-?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GGACTCCT
+
B@CCCFC<
pheniqs mux --input BDGGG_s01.fastq.gz --output "-?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
  • Interleave the 3 gzip compressed FASTQ files into one interleaved uncompressed FASTQ file on stdout
pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output "/dev/stdout?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 2:N:0:
GGACTCCT
+
B@CCCFC<
@M02455:162:000000000-BDGGG:1:1101:10000:10630 3:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
-i/--input can show up multiple times on the command line. Notice that the order matters…

pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s03.fastq.gz -i BDGGG_s02.fastq.gz --output "/dev/stdout?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 2:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 3:N:0:
GGACTCCT
+
B@CCCFC<
  • Interleave the 3 gzip compressed FASTQ files into one interleaved SAM output on stdout
pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output "/dev/stdout?format=sam"|less
@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output /dev/stdout?format=sam      VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     FI:i:1  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<        FI:i:2  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     FI:i:3  TC:i:3
  • Ok, now lets interleave the 3 gzip compressed FASTQ files into one interleaved CRAM formatted file
pheniqs mux -i BDGGG_s01.fastq.gz -i BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output BDGGG.cram

When done, Pheniqs will give you a JSON formatted report on stderr. The report will provide statistics on both input and output as well as separate statistics for both overall reads and reads passing quality control (Pass Filter or PF reads). The documentation on the website describes what each variable means. You can also redirect the report to a file with -R/–report. Since no sample classification was requested all reads were directed to the “unclassified” read group.
{
    "incoming": {
        "count": 2500,
        "pf count": 2430,
        "pf fraction": 0.972
    },
    "sample": {
        "classified count": 0,
        "classified fraction": 0.0,
        "classified pf fraction": 0.0,
        "count": 2500,
        "index": 0,
        "pf classified count": 0,
        "pf classified fraction": 0.0,
        "pf count": 2430,
        "pf fraction": 0.972,
        "unclassified": {
            "ID": "undetermined",
            "PU": "undetermined",
            "count": 2500,
            "index": 0,
            "pf count": 2430,
            "pf fraction": 0.972,
            "pf pooled fraction": 1.0,
            "pooled fraction": 1.0
        }
    }
}
To inspect your CRAM file you can, again, use Pheniqs. Essentially this is a format conversion from CRAM to SAM… CRAM is the state-of-the-art compressed version of BAM. It is basically a super compressed SAM file. If you want to archive your data this can save you a LOT of space.

pheniqs mux -i BDGGG.cram|less
@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux -i BDGGG.cram    VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
Or in an interleaved FASTQ

pheniqs mux -i BDGGG.cram -o "-?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GGACTCCT
+
B@CCCFC<
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
Oh wait! This is probably not what you expected?! Pheniqs thinks every one of the records in your file is a new read, when in fact every three records are 3 segments of the same read. To give you absolute control, Pheniqs by default, expects you to explicitly describe the interleaving nature of the input file. You can use the -s/–sense-input flag so ask Pheniqs to try and guess the interleaving nature of the file.

pheniqs mux -si BDGGG.cram|less
@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux -si BDGGG.cram   VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     FI:i:1  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<        FI:i:2  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     FI:i:3  TC:i:3
Om this makes more sense… Here it is in interleaved FASTQ

pheniqs mux -si BDGGG.cram -o "-?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 2:N:0:
GGACTCCT
+
B@CCCFC<
@M02455:162:000000000-BDGGG:1:1101:10000:10630 3:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
Here are the sizes of the files in the different formats.
sizetype
959Kfastq
177Kfastq.gz
1.1Msam
177Kbam
113Kcram
You can see how valuable compression can be. bam and cram files not only provide better compression they can also be read and written faster. The differences can become more pronounced with bigger files. Since Pheniqs can convert between the formats in memory you can store data on disk in a cram file and use Pheniqs to feed FASTQ to tools that accept FASTQ from standard input.

Read manipulation

Pheniqs can also parse and manipulate the content of the read. It uses a syntax similar to python array slicing to tokenize every read segment and assemble new segments from the tokens. The syntax for manipulating the read structure is too complex to provide on a command line, you will need a JSON config file.

Let’s start with a simple theoretical example. Suppose the first 8 nucleotides on the first segment are actually a cellular barcode and you want to separate them into a different segment. We create a configuration file that describes the layout:
{
  "template": {
    "transform": {
      "token": [ "0:8:", "1:0:", "0:0:8", "2:0:" ]
    }
  }
}
pheniqs mux --config manipulate.json -si BDGGG.cram |less
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       TAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     FGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     FI:i:1  TC:i:4
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<        FI:i:2  TC:i:4
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       CTAAGAAA        CCCCCGGG        FI:i:3  TC:i:4
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     FI:i:4  TC:i:4
We used the cram file as input since its faster to read and its also easier to feed Pheniqs with an interleaved file.

Since Pheniqs follows the python array slicing syntax it uses a zero based coordinate system. That means counters start from zero and are inclusive of start coordinate and exclusive of end coordinate. It might sound complicated but it actually makes arithmatics very easy. for instance the first 3 nucleotides on the first read segment are “0:0:3”.

A token is made of 3 colon separated integers. The first is the mandatory zero based input segment index. The second is an inclusive start coordinate and defaults to 0 if omitted. The third is an exclusive end coordinate. If the end coordinate is omitted the token spans to the end of the segment. start coordinate and end coordinate can take positive or negative values to access the segment from either the 5’ (left) or 3’ (right) end and mimic the python array slicing syntax. The two colons are always mandatory.

The template configuration element controls what the output read will look like. In this case we asked it to have 4 segments. The first output segment is “0:8:”, or cycles 8 to end of input segment 0. Pheniqs can validate your config file and describe, in words, what the conversion from input to output will be. here is the description given by

pheniqs mux --config manipulate.json -si BDGGG.cram --validate
Output transform

    Output segment cardinality                  4

    Token No.0
        Length        variable
        Pattern       0:8:
        Description   cycles 8 to end of input segment 0

    Token No.1
        Length        variable
        Pattern       1::
        Description   cycles 0 to end of input segment 1

    Token No.2
        Length        8
        Pattern       0::8
        Description   cycles 0 to 8 of input segment 0

    Token No.3
        Length        variable
        Pattern       2::
        Description   cycles 0 to end of input segment 2

    Assembly instruction
        Append token 0 of input segment 0 to output segment 0
        Append token 1 of input segment 1 to output segment 1
        Append token 2 of input segment 0 to output segment 2
        Append token 3 of input segment 2 to output segment 3

Classifying sample barcodes

Now assume that the 8 nucleotides on the second out of three segments in the input read is actually the sample barcode. That would be the standard Illumina protocol for a single indexed pair-end run. We want to classify the reads by that barcode. If sequencers never made mistakes and reads were perfect that would be a very easy task since you would just need to compare two strings with 8 characters. In reality sequencers make mistakes, between 1 in 1000 and 1 in 10000 on most Illumina sequencers.

Most tools that classify barcodes ignore the quality score the sequencer provides and simply look for either a perfect string match or at most 1 mismatch, if the barcodes in your set are different enough from each other to tolerate the mismatch. This method is called minimum distance decoding, abbreviated MDD, and Pheniqs can do that if you insist. You just set the decoder property "algorithm": "mdd".

But Pheniqs can do something much smarter, it can compute the posterior probability of the correct barcode being decoded given the list of possible barcodes (see the paper for all the mathematical nitty gritty) and make decisions based on that. It can even report that probability for every read in a standard SAM auxiliary field. Think of it as a quality score for decoding the barcode. Except its not just a quality score, its an actual probability! if you had two, independent, barcodes (for instance a sample and cellular barcode) the probability that the entire classification is correct is simply the product of the two! We call this a Phred Adjusted Maximum Likelihood Decoder, abbreviated PAMLD.

PAMLD has much less false negatives, a LOT less than MDD. You can expect it to recover a lot of the reads that MDD is missing. How many depends on how confident you want Pheniqs to be, which you specify in the confidence threshold parameter. But PAMLD does another cool thing, it filters noise. It can tell if a read is more likely to be noise than an actual read. For instance a PhiX read that you spiked into the mixture with your pooled libraries or some human DNA that found its way into your sample (how did that get here?!). Benchmarks on simulated data tell us that it is VERY good at picking up noise.

Let’s look at a config file for decoding the sample barcode.
{
    "input": [ "BDGGG.cram", "BDGGG.cram", "BDGGG.cram" ],
    "template": {
        "transform": {
            "token": [
                "0::",
                "2::"
            ]
        }
    },
    "sample": {
        "algorithm": "pamld",
        "confidence threshold": 0.99,
        "noise": 0.01,
        "transform": {
            "token": [ "1::8" ]
        },
        "codec": {
            "@AGGCAGAA": {
                "LB": "trinidad 5",
                "barcode": [ "AGGCAGAA" ],
                "concentration": 1
            },
            "@CGTACTAG": {
                "LB": "trinidad 4",
                "barcode": [ "CGTACTAG" ],
                "concentration": 1
            },
            "@GGACTCCT": {
                "LB": "trinidad 9",
                "barcode": [ "GGACTCCT" ],
                "concentration": 1
            },
            "@TAAGGCGA": {
                "LB": "trinidad 1",
                "barcode": [ "TAAGGCGA" ],
                "concentration": 1
            },
            "@TCCTGAGC": {
                "LB": "trinidad 8",
                "barcode": [ "TCCTGAGC" ],
                "concentration": 1
            }
        }
    }
}
Since we don’t want the sample barcode to be part of the output read we skip that segment in the template transform, we only want the first and last segment in the output.

Next we declare a sample instruction in the config file. The transform in the sample instruction tells Pheniqs the barcode is on the first 8 nucleotides of the second input segment. The codec section is where you specify details on the barcodes you expect to find in your data. You naturally specify the barcode sequence, or multiple sequences if for instance you use dual indexing, but really you can have any number of segments in your barcodes, as long as their length and structure matches what you specified in the transform. Pheniqs has good error handling, if your config file has an error it will tell you exactly what went wrong.

For instance if in the first barcode you only specified 7 nucleotides:
    "@AGGCAGAA": {
        "LB": "trinidad 5",
        "barcode": [ "AGGCAGA" ],
        "concentration": 1
    },
Configuration error : sample decoder : expected 8 but found 7 nucleotides in segment 0 of barcode @AGGCAGAA
or if you copy/pasted your barcodes wrong and had the first one specified twice:
    "@AGGCAGAA": {
        "LB": "trinidad 5",
        "barcode": [ "CGTACTAG" ],
        "concentration": 1
    },
    "@CGTACTAG": {
        "LB": "trinidad 4",
        "barcode": [ "CGTACTAG" ],
        "concentration": 1
    },
Configuration error : sample decoder : duplicate barcode sequence CGTACTAG
We specified the input in the config file instead of the command line, so we don’t have to keep typing it. We also specified how to parse the input explicitly, instead of using the –sense-input, by specifying the interleaved input file 3 times, telling Pheniqs to pull three records for every read, or that it should expect to find 3 consecutive segments for every read in the file. being able to explicitly specify how to parse the input allows you to deal with malformed files (lets face it, bioinformatics isn’t perfect).

Notice we specified a 0.99 confidence threshold, so any barcode with a posterior decoding probability lower than that will be marked as failing quality control and moved to the unclassified (or undetermined) group. confidence threshold defaults to 0.95 but it represents a tradeoff between false positives and false negatives and the ideal value depends on your data and your barcode set, and frankly, what you want. We found 0.95 to be a fair tradeoff but if you prefer to loose more reads but be super confident about them you can set higher values and if you feel that your downstream analysis will be able to detect false positives and really want to recover as many as possible reads you can set lower values.

PAMLD’s Bayesian model depends on a set of prior probabilities, basically how much noise (the noise variable) and how much of each sample (the concentration variable in each barcode) you expect to find in your data. Noise defaults to 0.01, which is 1%, which is the amount of PhiX textbook Illumina runs have spiked in. But if you are doing anything other than a vanilla Illumina protocol you probably have more. if you leave out the concentration they all default to 1, essentially forming a uniform distribution. Pheniqs will normalize what ever you specify in concentration so that they all sum up to 1 – noise. For instance if you specify 1 in all your barcodes and 0.01 in noise and you had 5 samples they will each be assumed to have a 0.99 / 5 = 0.198 prior. What we have in the above config file could have just beed left out since its the default.

If you have a good guess you can specify your priors in the config. We will see later that Pheniqs can give you a pretty good estimate for them.

First lets see how the output looks
pheniqs mux --config decode.json |less
@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@RG     ID:AGGCAGAA     BC:AGGCAGAA     LB:trinidad 5   PU:AGGCAGAA
@RG     ID:CGTACTAG     BC:CGTACTAG     LB:trinidad 4   PU:CGTACTAG
@RG     ID:GGACTCCT     BC:GGACTCCT     LB:trinidad 9   PU:GGACTCCT
@RG     ID:TAAGGCGA     BC:TAAGGCGA     LB:trinidad 1   PU:TAAGGCGA
@RG     ID:TCCTGAGC     BC:TCCTGAGC     LB:trinidad 8   PU:TCCTGAGC
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --config decode.json     VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.0616e-06
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.0616e-06
M02455:162:000000000-BDGGG:1:1101:10005:7907    589     *       0       0       *       *       0       0       CCGTTTGAATGTTGACGGGATGAACATAATAAGCAATGACGGCAGCAATAA     CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGG     RG:Z:undetermined       BC:Z:ATGTTTAT   QT:Z:@----6,,
M02455:162:000000000-BDGGG:1:1101:10005:7907    653     *       0       0       *       *       0       0       GAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCG     CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGG     RG:Z:undetermined       BC:Z:ATGTTTAT   QT:Z:@----6,,
Pheniqs has declared a read group for every sample barcode and one for the undetermined. The BC and QT tags hold the values for the uncorrected sample barcode and the quality scores for it. RG is a reference to the Read Group ID as declared in the header. Those are all standard tags, you will find them in the SAM specifications. XB is the special Pheniqs sauce, it is the probability that the sample barcode decoding was incorrect, the error probability. In the second read (remember, this is an interleaved file, every two records are a read. You can see they have the same read ID) decoding failed. BC and QT are there but RG points to undetermined and there is no XB. If you will also notice the flag field is different since the bit marking this read as failing QC is on.

You can add Read Group metadata to your config file and it will end up where you expect it to. You already saw an example of that with the LB tag. for instance we add some global ones so they are added to all Read Groups.
{
    "CN": "CGSB",
    "DT": "2018-02-25T07:00:00+00:00",
    "PI": "300",
    "PL": "ILLUMINA",
    "PM": "miseq",
    "SM": "trinidad",
    "input": [ "BDGGG.cram", "BDGGG.cram", "BDGGG.cram" ],
    .
    .
    .
}
and we get a richer header
@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined CN:CGSB DT:2018-02-25T07:00:00+00:00    PI:300  PL:ILLUMINA     PM:miseq        PU:undetermined SM:trinidad
@RG     ID:AGGCAGAA     BC:AGGCAGAA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 5   PI:300  PL:ILLUMINA     PM:miseq        PU:AGGCAGAA     SM:trinidad
@RG     ID:CGTACTAG     BC:CGTACTAG     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 4   PI:300  PL:ILLUMINA     PM:miseq        PU:CGTACTAG     SM:trinidad
@RG     ID:GGACTCCT     BC:GGACTCCT     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 9   PI:300  PL:ILLUMINA     PM:miseq        PU:GGACTCCT     SM:trinidad
@RG     ID:TAAGGCGA     BC:TAAGGCGA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 1   PI:300  PL:ILLUMINA     PM:miseq        PU:TAAGGCGA     SM:trinidad
@RG     ID:TCCTGAGC     BC:TCCTGAGC     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 8   PI:300  PL:ILLUMINA     PM:miseq        PU:TCCTGAGC     SM:trinidad
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --config decode.json     VN:2.1.0

Estimating the priors

Suppose you have no idea what the priors are. You THINK you add the same amount of each pooled library but who know what actually happened. You also have no idea how much noise is there…

You make an initial Pheniqs run with your best guess, or with the defaults, and specify the --prior command line argument and provide a path to a new config file. Pheniqs will use the run to estimate the relative proportions of barcodes and noise in your data and write a new config file that contains the estimated priors. Since you don’t really want the actual output reads from this run you can save some time by using another nifty Pheniqs feature: you specify the output to be /dev/null basically telling Pheniqs you don’t want the output. That will save quite a bit of time since Pheniqs won’t waste time formatting the output or writing it anywhere.

pheniqs mux --config decode.json --prior decode_with_prior.json --output /dev/null 
when its done running you will find a new config file at decode_with_prior.json.
{
    "CN": "CGSB",
    "DT": "2018-02-25T07:00:00+00:00",
    "PI": "300",
    "PL": "ILLUMINA",
    "PM": "miseq",
    "SM": "trinidad",
    "input": [
        "BDGGG.cram",
        "BDGGG.cram",
        "BDGGG.cram"
    ],
    "output": [
        "/dev/null"
    ],
    "prior adjusted job url": "decode_with_prior.json?format=json",
    "sample": {
        "algorithm": "pamld",
        "codec": {
            "@AGGCAGAA": {
                "LB": "trinidad 5",
                "barcode": [
                    "AGGCAGAA"
                ],
                "concentration": 0.186575935178905
            },
            "@CGTACTAG": {
                "LB": "trinidad 4",
                "barcode": [
                    "CGTACTAG"
                ],
                "concentration": 0.198520089969607
            },
            "@GGACTCCT": {
                "LB": "trinidad 9",
                "barcode": [
                    "GGACTCCT"
                ],
                "concentration": 0.211699846980038
            },
            "@TAAGGCGA": {
                "LB": "trinidad 1",
                "barcode": [
                    "TAAGGCGA"
                ],
                "concentration": 0.219937195111557
            },
            "@TCCTGAGC": {
                "LB": "trinidad 8",
                "barcode": [
                    "TCCTGAGC"
                ],
                "concentration": 0.168041901882987
            }
        },
        "confidence threshold": 0.99,
        "noise": 0.015225030876904,
        "transform": {
            "token": [
                "1::8"
            ]
        }
    },
    "template": {
        "transform": {
            "token": [
                "0::",
                "2::"
            ]
        }
    }
}
Notice the concentration and noise parameters have been adjusted to reflect what Pheniqs guessed from the initial run. The actual methodology for this prior estimate is described in the paper and in the manual. Notice the method is recursive since it uses your initial values as the base for the estimation. you can potentially run it a second time and get an even more accurate estimate but you also run the risk of over training your priors, but this is a different discussion… our benchmarks on synthetic data showed that a single estimation that started from a uniform distribution on data that was VERY non uniform still yielded a significant improvement in both precision and recall, and was just shy of using the real (and obviously unknown when your data is real and not simulated) priors.

one small minor detail is that since you executed Pheniqs with a null output your new config also specifies a null output. you can either edit the config or override it on the command line when you use this new config. pheniqs mux --config decode_with_prior.json -o -|less
@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined CN:CGSB DT:2018-02-25T07:00:00+00:00    PI:300  PL:ILLUMINA     PM:miseq        PU:undetermined SM:trinidad
@RG     ID:AGGCAGAA     BC:AGGCAGAA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 5   PI:300  PL:ILLUMINA     PM:miseq        PU:AGGCAGAA     SM:trinidad
@RG     ID:CGTACTAG     BC:CGTACTAG     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 4   PI:300  PL:ILLUMINA     PM:miseq        PU:CGTACTAG     SM:trinidad
@RG     ID:GGACTCCT     BC:GGACTCCT     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 9   PI:300  PL:ILLUMINA     PM:miseq        PU:GGACTCCT     SM:trinidad
@RG     ID:TAAGGCGA     BC:TAAGGCGA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 1   PI:300  PL:ILLUMINA     PM:miseq        PU:TAAGGCGA     SM:trinidad
@RG     ID:TCCTGAGC     BC:TCCTGAGC     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 8   PI:300  PL:ILLUMINA     PM:miseq        PU:TCCTGAGC     SM:trinidad
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --config decode_with_prior.json -o -     VN:2.1.0-6-gb006bb7a4990e86e62e4f0085e405662e88daef6
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.10298e-06
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.10298e-06

Splitting output

Interleaved files have many advantages: single input and single output allow you to use standard POSIX streams and avoid storing intermediate files. Its faster to read and write. It keeps experimental directory structures simpler and makes archives easier to browse. But sometimes you need to split your reads over multiple files, either by read segment or by barcode. Pheniqs can do that.

In the same fashion that the input directive allowed you to dictate multiple files to pull read segments from the output directive allows you to split the segments over multiple files.

Read segment Splitting

If your output read has multiple segments you can either specify just one output, in which case all segments will be interleaved into that one output. Or you can specify as many outputs as you have output segments and the segments will be written to the seperate files.

either on the command line
pheniqs mux --config decode.json --prior decode_with_prior.json --output first_segment.fastq.gz -o second_segment.bam 
or in the config file
{
    "output": [
        "first_segment.fastq.gz",
        "second_segment.bam"
    ]
}
Each output file is evaluated independently so they don’t have to have the same format!

Barcode Splitting

You can also split reads into multiple files by barcodes. Either sample, cellular or molecular barcodes as long as you only split by one decoder. To do that you specify the output directive in the config section for each barcode.
{
    "CN": "CGSB",
    "DT": "2018-02-25T07:00:00+00:00",
    "PI": "300",
    "PL": "ILLUMINA",
    "PM": "miseq",
    "SM": "trinidad",
    "input": [
        "BDGGG.cram",
        "BDGGG.cram",
        "BDGGG.cram"
    ],
    "output": [
        "BDGGG_default.bam"
    ],
    "prior adjusted job url": "decode_with_prior.json?format=json",
    "sample": {
        "algorithm": "pamld",
        "codec": {
            "@AGGCAGAA": {
                "LB": "trinidad 5",
                "barcode": [
                    "AGGCAGAA"
                ],
                "concentration": 0.186575935178905,
                "output": [
                    "BDGGG_AGGCAGAA.bam"
                ]
            },
            "@CGTACTAG": {
                "LB": "trinidad 4",
                "barcode": [
                    "CGTACTAG"
                ],
                "concentration": 0.198520089969607,
                "output": [
                    "BDGGG_CGTACTAG_s01.fastq.gz",
                    "BDGGG_CGTACTAG_s02.fastq"
                ]
            },
            "@GGACTCCT": {
                "LB": "trinidad 9",
                "barcode": [
                    "GGACTCCT"
                ],
                "concentration": 0.211699846980038
            },
            "@TAAGGCGA": {
                "LB": "trinidad 1",
                "barcode": [
                    "TAAGGCGA"
                ],
                "concentration": 0.219937195111557
            },
            "@TCCTGAGC": {
                "LB": "trinidad 8",
                "barcode": [
                    "TCCTGAGC"
                ],
                "concentration": 0.168041901882987
            }
        },
        "confidence threshold": 0.99,
        "noise": 0.015225030876904,
        "transform": {
            "token": [
                "1::8"
            ]
        }
    },
    "template": {
        "transform": {
            "token": [
                "0::",
                "2::"
            ]
        }
    }
}
You can still set a global output and data from any barcode that has not declared its own output directive will be routed to the global output. You can have different layouts for each barcode. When we said Pheniqs is flexible, we meant it 😉

Building R Packages

Hands On

Load Required Packages
# devtools automates some of the setup for a new package
library(devtools)

# roxygen2 handles building the documentation
library(roxygen2)
Define Some Variables
# Define package directory
package_wd <- "/home/mk5636/badas-r-package"

# Define package name as string
package_name <- 'badas'
Create A Package Skeleton Using create()
dir.create(file.path(package_wd))
setwd(file.path(package_wd))

# The 'create' function from devtools creates the
# directory structures and skeleton help files
# for a new package
create(package_name)
Examine at the directory structure and DESCRIPTION file that was created by the create() function

Create a Function

Create an R function with a roxygen2-style header (for documentation). This will be incorporated into the package.
#' custom_add
#'
#' A custom function to add two numbers together
#'
#' @name custom_add
#' @param x The first number.
#' @param y The second number.
#' @return The result of adding the two numbers.
#' @export
#' @examples
#' result <- custom_add(1,2)
custom_add <- function(x,y){
    return(x+y)
}
Let’s test it out
custom_add(1, 2)
## [1] 3
Copy and paste the function into a file called {function_name}.R in {package_wd}/R/

Add data to a function

# Name the data
data_name <- 'mydata'

# Create the data object
data_obj <- c(1,2)

# Assign the data name to the data object
assign(x=data_name, value=data_obj)

# Create the data folder
data_path <- paste(package_wd, "/", package_name, "/data/" ,sep="")
dir.create(file.path(data_path))

# Write .rda file to data folder
data_file <- paste(data_path, data_name, ".rda", sep="")
save(list=data_name, file=data_file)
Create the documentation for the data
#' mydata
#'
#' A vector containing some test data.
#'
#' @format A vector with 2 values: c(1,2)
#' @source \\url{http://www.flybase.org}
Copy and paste the above into a file called {data_name}.R in {package_wd}/R/

Document, Build, and Install the Package

setwd(file.path(package_wd))

# The 'document' function from devtools is a wrapper for the
# roxygen2::roxygenize() function from the roxygen2 package
# It builds all documentation for a package
document(package_name)
# The 'build' function from devtools converts a
# package source directory into a single bundled file. 
# Returns the path to the built package
package_tar <- build(package_name)
# Let's install the package now
install.packages(package_tar, repos=NULL, type='source')

Quit your R session, start a new session, and try the following commands

This shouldn’t work
custom_add(1,2)
Error in custom_add(1, 2) : could not find function "custom_add"
This should work:
library(badas)
custom_add(1,2)
## [1] 3
See how the roxygen2 header we created for the function translates into documentation
?custom_add

Congratulations, you have successfully built an R package! Now you can share the tarball with collaborators to allow them to use your package. Next we will learn about how to submit packages to CRAN and how to install packages from GitHub.

Submitting to CRAN

Submitting to CRAN used to be very hard: creating documentation by hand, manually checking everything, etc. But with tools like devtools, roxygen2, etc. it has become much easier.

1) Might be a good idea to check the package name you want to use is not already taken. You can search the CRAN package list, or you can use the package available.

library(available)
available("available", browse = FALSE)
You can also use suggest to help you come up with a name, ex:
suggest("Convert Yeast IDs")
## [1] convertr
You probably want to do this before you build your package.

2) Run the check_built() function to ensure:

  • package can be built, installed, and uninstalled without issues
  • no potential conflicts, plays nice with other packages
  • required documentation present
  • correct formatting
  • no warnings
  • no notes
check_built(package_tar)
If there are notes, you need to explain them in the comments section when you submit your package.

3) Once you’re ready, submission is done through the web:
https://cran.r-project.org/submit.html

Upload the tarball, which must have been built using the build() function as we did, along with your name and email address, and comments optionally (here is where you would explain any notes from the check_built() function for example).

Please note: You should read the policies and ensure your package is compliant prior to submitting:
https://cran.r-project.org/web/packages/policies.html

Installing Packages from GitHub

library(devtools)
install_github("GreshamLab/labtools")

Intro to Nextflow

Hands On

The figure below depicts the simple pipeline we will build. Note the dependencies (indicated with arrows) and the steps which run in parallel (same row).
Launch interactive session
srun -c2 -t1:00:00 --mem=10000 --pty /bin/bash

Load nextflow
module load nextflow/20.01.0

Make the config file
params.reads = "/scratch/mk5636/nextflow-badas/data/*.fastq.gz"
params.ref = "/scratch/work/cgsb/genomes/Public/Fungi/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa"
params.out = "$SCRATCH/nextflow-badas/out"
workDir = params.out + '/nextflow_work_dir'


Building the workflow (main.nf)
// Setup the reference file
 ref = file(params.ref)

// Custom filenames
Channel
     .fromFilePairs( params.reads, size: -1)
     { file -> file.getBaseName() - ~/_n0[12]/ - ~/.fastq/ }
     .ifEmpty { error "No reads matching: ${params.reads}"  }
     .set { read_pairs_ch }

// Standard Illumina filenames
//Channel
//     .fromFilePairs( params.reads )
//     .ifEmpty { error "No reads matching: ${params.reads}"  }
//     .set { read_pairs_ch }

process align {
    publishDir "${params.out}/aligned_reads", mode:'copy'

    input:
    set pair_id, file(reads) from read_pairs_ch

    output:
    set val(pair_id), file("${pair_id}_aligned_reads.sam") into aligned_reads_ch

    script:
    """
    module load bwa/intel/0.7.17
    bwa mem \
        $ref \
        ${reads[0]} \
        ${reads[1]} \
        > ${pair_id}_aligned_reads.sam
    """
}

process sort_index_reads {
}

process aligned_read_stats {
}
Using Slurm (add this to your config file)
process {
    executor = 'slurm'
    cpus = 10
    memory = '30 GB'
    time = '30 min'
    withName: align_reads { cpus = '20'}
}

Using sbatch to submit your workflow
 #!/bin/bash 

 #SBATCH –nodes=1
 #SBATCH –ntasks-per-node=1
 #SBATCH –cpus-per-task=2
 #SBATCH –time=5:00:00
 #SBATCH –mem=2GB
 #SBATCH –job-name=myJob
 #SBATCH –output=slurm_%j.out

 module purge
 module load jdk/1.8.0_111
 module load nextflow/20.01.0

 nextflow run main.nf -c your.config 

Intro to ggplot

Load packages

require(ggplot2)

How ggplot works

The basic structure of a plot

Let’s load in some data that we can use for plotting: the iris dataset, a built-in dataset in R that contains petal and sepal dimensions of various individuals from three different iris species.
print(iris)
Let’s use base R to plot the Sepal Length vs Sepal Width for all the data
plot(iris$Sepal.Length, iris$Sepal.Width)
The same plot in ggplot is a bit more complicated to put together:
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point()
The above contains the components that are the bare minimum of what we need for a ggplot plot; we can add more on later, but let’s dissect the parts of this command:
ggplot(data = <DATA>, mapping = aes(<Mapping>)) +
        <GEOM_FUNCTION>()
  • arguments:
    • data: the dataframe you want to plot
    • mapping: Any variables from your data that affect plot output, listed in aes( )
  • commands:
    • ggplot( ): required start of every ggplot command. Contains any options that we want to apply to the whole plot (which can be nothing)
    • geom_{something}( ): how you’re plotting the data. Here, we want to plot points, so we’re using geom_point; there are tons of different geoms available, one for each type of plot you might want to make.
Arguments like data and mapping can go in the parentheses after the geom, producing the same plot as above:
ggplot() +
  geom_point(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width))
But there are specific situations in which it’s better to do this (we’ll see them later)

Modifying geom properties

We can also pass additional arguments to the geom: useful ones to know are:
  • color: line color; for the default shape used in geom_point, this actually colors the inside of the shape as well
  • fill: the fill color inside a shape
  • size: point size or line thickness
  • shape: for points, this is the shape; for lines, this is the line pattern or dashyness
  • alpha: transparency level, with 0 being totally transparent and 1 being a solid, opaque color
For example:
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(color = 'blue', fill = 'yellow', shape = 23, alpha = 0.33, size = 5)
Why are some of these rhombuses darker than others? Note that any arguments that universally affect the properties of the points, lines, etc that we’re plotting, like the ones we used above, must be passed to the relevant geom, not to the ggplot( ) command. This is because the geom is in charge of making the points!

Mapping lots of variables

The plot we made above isn’t really all that useful. It’s great to see the data across all three species on one plot, but if we’re looking at this data, we’re probably actually interested in how these species differ from each other. So how do we make ggplot visually separate the points by species? Remember that the mapping argument deals with any properties of the plot that depend on variables in the supplied data frame. So we can modify our original code like this:
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(alpha = 0.33)
# Can also be written as:
ggplot(data = iris) +
  geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species), alpha = 0.33)
Notice that the plot above uses both a variable-dependent color (based on the iris dataframe’s Species column), which goes inside aes( ), and a variable-independent alpha value that applies to the whole geom_point command and goes outside aes( ) Also, notice that you got a legend for free! You didn’t have to tell ggplot how to make it, or what info to include in it; it knows automatically based on how you set up your mapping. Depending on context, you can make color, fill, shape, size or alpha variable-dependent. Some of these (color, fill, shape) obviously make more sense for categorical variables, while others (alpha, size) make more sense for continuous variables, but ggplot will only rarely stop you from making aesthetically and data representationally questionable choices here. Let’s try an exercise: based on the code above, make a plot where Sepal.Length is on the x axis, Sepal.Width is on the y axis, all the points are colored red, the shape of the point depends on the Species, and the point size depends on Petal.Width Questionable usefulness, but hey, it’s possible and pretty easy…

Stacking multiple geoms

One of the places where ggplot really shines is when you want to combine multiple data representations on one plot. For example, I really like topology-style contour plots, which ggplot can make with geom_density2d. Once we know how to make a basic plot, and combining a contour plot with a plot the individual data points is super easy in ggplot:
# note, the first two lines are just our plot from above
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_density2d() +
  geom_point(alpha = 0.33)
Notice that the alpha argument we provided only applies to geom_point, so the contour lines don’t show any transparency. However, any arguments provided to mapping in an aes( ) statement in the ggplot( ) command apply across all geoms. (Also, notice that when we add a geom, ggplot automatically updates our legend!) One really powerful application of this is that we can actually make each geom( ) represent a different aspect of the same data. Let’s say we’d like our datapoints to be colored by species, but we’d also like to see a contour plot of sepal length vs width across all the species. To do this, we’re going to have to move our mapping calls inside the geoms, since we now want each geom to map the data differently:
# got rid of alpha here just to simplify things
ggplot(data = iris) +
  geom_density2d(aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(aes(x = Sepal.Length, y = Sepal.Width, color = Species))
This plot shows that mapping actually controls not just where to plot the data points and how they should look aesthetically, but also how the data is grouped when it’s represented in the plot. Notice that in the first contour plot, the statistics needed to plot the contours were computed separately for each species. However, when we removed species from the aes( ) being used by geom_density2d, the data was no longer separated by species for any of the stats calculated for this geom, and they’re instead calculated across all the points in the dataset.

Aside: ggplot objects

ggplot actually creates objects that we can store as variables and add onto. So, for example, we can do this:
basic_iris_plot <-
  ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point()
print(basic_iris_plot)
# let's add another geom to this plot
iris_plot_with_contours <-
  basic_iris_plot + geom_density2d()
print(iris_plot_with_contours)

themes and other options we can change

ggplot also allows a huge amount of control over other aspects of the plot (e.g. titles, axis labeling and scale, overall plot look, etc). For most of these, ggplot actually allows multiple equivalent ways to achieve the same effect.

axes + titles

Adding a title to a plot can be achieved using ggtitle()
basic_iris_plot +
  ggtitle('Iris Sepals')
We can also modify the axis properties directly
basic_iris_plot +
  ggtitle('Iris Sepals') +
  scale_x_continuous(name = 'Sepal Length',
                     limits = c(0,10)) +
  scale_y_log10(name = 'Sepal Width',
                breaks = c(2,3,4))
There’s a few things going on here:
  • scale_x_continuous( ) and scale_y_log10( ): set the scale of the x and y axes. For continuous variables, they can also be plotted on a square root scale, reversed, and various other transformations. For discrete variables, use scale_x_discrete( ) and scale_y_discrete( )
  • name: the axis label
  • limits: the bounds on the axis, must be provided as a 2-number vector
  • breaks: manually assign where the tickmarks go
  • labels: for discrete variables, this can be used to rename the categories along your axis

legend

You can modify the legend in a similar way to the other mappings (e.g. the axes); for example, if we want to modify the way the thing mapped to ‘color’ on our plot is represented, we can use scale_color_discrete( ), or, if we want to manually change the values assigned to each category (e.g. the colors), scale_color_manual( ):
basic_iris_plot +
  scale_color_manual(values=c("violet", "blue", "gray"),
                     name="Iris Species",
                     labels=c("Bristle-Pointed Iris", "Blue Flag", "Virginia Iris"))
We can also change the position of the legend using theme( ) (which can actually control nearly every other aesthetic aspect of the plot, such as font size, which axes get labels/tickmarks, etc).
basic_iris_plot +
  scale_color_manual(values=c("violet", "blue", "gray"),
                     name="Iris Species",
                     labels=c("Bristle-Pointed Iris", "Blue Flag", "Virginia Iris")) +
  theme(legend.position = 'bottom')

themes

Finally, the overall appearance of the graph can be changed by selecting a custom ‘theme’; this is a bit confusing, since these are distinct from the theme( ) command used above.
basic_iris_plot +
  scale_color_manual(values=c("violet", "blue", "gray"),
                     name="Iris Species",
                     labels=c("Bristle-Pointed Iris", "Blue Flag", "Virginia Iris")) +
  theme(legend.position = 'bottom') +
  theme_bw()

The structure of data that ggplot can plot

As you’ve seen, ggplot provides users with the power to easily change the appearance of the plot, and the statistics calculated, based on any single column in the dataframe containing the data to be plotted. But this also results in some pretty rigid rules about how your data needs to be organized. Namely, data for ggplot should be in tidy format:
  • each variable must have its own column
  • each observation must have its own row (but what’s an observation?)
  • each value must have its own cell
Let’s take a look at what that means. Compare the iris dataframe we’ve been using to the iris3 data, which comes with R and contains the same data:
iris_3_df <- data.frame(iris3)
print(iris_3_df)
Notice that now, there is a single row containing data on plants from each of the three iris species. This is not a completely crazy thing to do: maybe our experiment consisted of 50 individual plots, each of which had a plant from every species, and we collected the data on a plot-by-plot level. But organizing the data in this way makes directly graphing it with ggplot a real pain, at least if we want to compare species with each other. We no longer have a ‘species’ column, or neat columns for other mappings we might be interested in (e.g. Sepal Width). Here’s an attempt to make do with what we have:
ggplot(data = iris_3_df) +
  geom_point(mapping = aes(x = Sepal.L..Setosa, y = Sepal.W..Setosa), color = 'red') +
  geom_point(mapping = aes(x = Sepal.L..Versicolor, y = Sepal.W..Versicolor), color = 'blue') +
  geom_point(mapping = aes(x = Sepal.L..Virginica, y = Sepal.W..Virginica), color = 'green') +
  scale_x_continuous(name = 'Sepal Length') +
  scale_y_continuous(name = 'Sepal Width')
Some things still work well automatically (e.g. ggplot scales axes for us), but it’s a lot more effort to do this, and if we wanted to have a legend on this plot (or had more than a few categories we were interested in), it would be a complete nightmare. When putting together data to plot, we need to think very carefully about what exactly constitutes a single ‘observation’, and what the ‘variables’ are that we want to use for mapping. The tidyr package (which, like ggplot2, is part of the tidyverse package) has some really great functions for re-organizing data that looks like iris3 into a ‘tidy’ dataframe, and if you find yourself facing data that isn’t organized the right way for your plot, I really suggest looking over David Gresham’s tidyverse tutorial.

Why ggplot

  • easy exploratory data analysis: separation of variable mapping from visual representation (e.g. geoms) makes it easy to try different ways of plotting data
  • automation of the boring stuff: generation of legends, axis bounds, etc is done very well automatically based on the data and variables you’re plotting (but you can have finer control of it if you’d like)
  • automated stats: lots of geoms that calculate and display statistics of your data. These statistics are automatically calculated based on your grouping of the data, how you specify your axes (e.g. linear vs log scale), etc.
    • geom_density and geom_density2d for density estimation
    • geom_smooth for trendlines with error ribbons
    • stat_summary for other ways to bin and summarize data)
  • plots as objects: storing plots as objects allows them to be easily modified and combined into figures
  • incredibly helpful online examples: the tidyverse website contains an incredible online manual with explanations and clear examples for nearly everything you might want to do in ggplot: https://ggplot2.tidyverse.org/reference/

Using ggplot for paper figures

Because ggplot does a great job of separating aesthetic properties of the plot from what is being plotted, we can create a theme that defines how our plots look in e.g. paper figures and apply it to all our plots after generating them. First, let’s set a theme for our plots. Because we want our figures to look nice and consistent for the paper, there’s a lot of options we can specify here.
final_figure_ggplot_theme <- 
  theme(plot.title=element_text(size=16,face='bold'),
        plot.margin=unit(c(12,12,12,12),'pt'),
        panel.background=element_rect(fill='white'),
        panel.grid.major=element_line(color='grey',size=0.3),
        axis.line = element_line(color="black", size = 0.5),
        legend.title=element_blank(),
        legend.justification=c(0,1),
        legend.key = element_rect(fill='white'),
        legend.key.height = unit(2,'line'),
        legend.text=element_text(size=12,face='bold'),
        axis.text.x=element_text(size=12,face='bold'),
        axis.text.y=element_text(size=12,face='bold'),
        axis.title.x=element_text(size=14,face='bold'),
        axis.title.y=element_text(size=14,face='bold',angle=90)) +
  theme_bw()
Next, let’s create some plots. We don’t have to worry about appearances here; let’s just make sure the data shows up the way we want it to.
# a figure with the iris data
sepal_points <-
  ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point()

sepal_trends <-
  ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_smooth(method = 'lm')

petal_boxplot <-
  ggplot(data = iris, aes(x = Species, y = Petal.Width, color = Species)) +
  geom_boxplot() +
  geom_point(position = 'jitter')
We can print these figures to look at them:
print(sepal_points)
print(sepal_trends)
print(petal_boxplot)
Useful, but not that neat. Now, let’s apply our figure theme:
sepal_points_figure <- sepal_points + final_figure_ggplot_theme
print(sepal_points_figure)
sepal_trends_figure <- sepal_trends + final_figure_ggplot_theme
print(sepal_trends_figure)
petal_boxplot_figure <- petal_boxplot + final_figure_ggplot_theme
print(petal_boxplot_figure)
There are a few ways to save the figure, but this is probably the easiest:
ggsave(file = '~/Documents/sepal_points_figure.pdf',
       plot = sepal_points_figure, width = 6.5, height = 4,
       useDingbats=FALSE)
(If saving as a pdf, useDingbats=FALSE is a must and will prevent a ggplot disaster from unfolding. If you load the cowplot package described below, it replaces ggsave with its own function that does this by default) Because we specified our font sizes in final_figure_ggplot_theme, they will be consistent across plots, regardless of the size we decide to save them at.

An example from the literature

eLife reproducible article

Other cool things

combining multiple data frames on one plot

ggplot makes it super easy to combine multiple datasets on one plot, assuming they have the relevant variables (dataframe columns) in common. Let’s break up the iris dataframe to see how this works:
iris_nonvirginca <- subset(iris, Species != 'virginica')
iris_virginica_petals <- subset(iris, Species == 'virginica')[, c('Petal.Width', 'Species')]
print(iris_nonvirginca)
print(iris_virginica_petals)
We now have two dataframes, containing data on different species, and with only a subset of the data in one that is contained in the other (the petal widths and species). But if petal width and species is what we want to plot, this isn’t a problem for ggplot:
ggplot() +
  geom_boxplot(data = iris_nonvirginca, aes(x = Species, y = Petal.Width, color = Species)) +
  geom_boxplot(data = iris_virginica_petals, aes(x = Species, y = Petal.Width, color = Species))

facets

Another great tool ggplot provides is faceting. This allows you to separate data into subplots based on a column (or multiple columns):
basic_iris_plot +
  facet_wrap( ~ Species)
Notice that the x-axes are consistent among these plots.

Lots of additional packages!

Because ggplot is so popular, there’s been a ton of additional packages written that build on top of it. Here are two examples.

gganimate

Add animations to plots
library(gganimate)
basic_iris_plot +
  transition_states(Species)

cowplot

Arrange your plots for publication figures
library(cowplot)
package ‘cowplot’ was built under R version 3.4.4
Attaching package: ‘cowplot’

The following object is masked from ‘package:ggplot2’:

    ggsave
plot_grid(sepal_points_figure, sepal_trends_figure, petal_boxplot_figure,
          labels = "AUTO")
library(cowplot)
top_row <- plot_grid(sepal_points_figure, labels = 'A')
bottom_row <- plot_grid(sepal_trends_figure, petal_boxplot_figure, labels = c('B', 'C'))
plot_grid(top_row, bottom_row, nrow = 2)