Pheniqs

What is Pheniqs and why do you need it?

Pheniqs is a fast and accurate high-throughput sequencing barcode classifier.
  • It can convert sequence reads to and from FASTQ, SAM, BAM and CRAM formats, decode barcodes and manipulate read layout in a single process.
  • It is very accurate, much more than traditional minimum distance based methods that simply compare the number of mismatches between two sequences. More specifically is has lower False Discovery Rate and at the same time a much lower Miss Rate. In simple terms it misclassifies less reads, introduces less noise and at the same time recovers more reads.
transform patterns
This plot compares false discovery rates and miss rates between Pheniqs and traditional minimum distance decoding. The X axis shows increasing average error rates. In the error range relevant to most Illumina runs, which is between 1 in 10000 and 1 in 1000, Pheniqs has lower false discovery rates than MDD while still maintaining a much lower Miss Rate.

  • It is very fast. A benchmark of decoding 94 standard Illumina dual indexed barcodes in more than 11.5 billion reads on a CPU with 14 cores completed in just over 3 hours. deML for comparison, another statistical classifier that is still much faster than python implementations, took almost 80 hours to complete the same task.
  • It can handle complex barcode designs with a flexible syntax that can address any number of barcodes anywhere on the read. That means no need for pre or post processing.
transform patterns
In this diagram you can see how Pheniqs uses a syntax similar to python array slicing to address sample, cellular and molecular barcodes on a read made out of 2 biological segments and two index segments and produde a SAM output.

Installing with conda

Pheniqs can be installed in many ways. You can install it with conda, you can use the pheniqs-build-api to build your own portable binary or you can build it from source. You can have a look here for details.

But the good people at the hpc have created a module for us on greene so we will use this for this demo because it takes exactly two seconds to use it 🙂 module load pheniqs/2.1.0

Example files

In this tutorial we will be using a small toy example with 2500 reads. The reads are pair-ended single indexed and were produced by a MiSeq instrument: Two biological segments that are 51 nucleotides long and a single index that is 8 nucleotides long. You are probably not used to seeing Illumina output this way since samples are often demultiplexed by the core facility but the raw output from the sequencer includes each segment in a gzip compressed fastq file with full read quality data.

You can get the three gzip compressed FASTQ files here: BDGGG_s01.fastq.gz, BDGGG_s02.fastq.gz, BDGGG_s03.fastq.gz.

Format conversions and interleaving

  • Both input and output can be FASTQ or SAM/BAM/CRAM in either split or interleaved lacing.
  • Input format is automatically detected from file content, not from the extension, so files without an extension or even with a wrong extension are fine (though still not good practice).
  • Output format defaults to uncompressed SAM on stdout.
  • Output format is determined by the output file extension: fastq, fastq.gz (gzip compressed FASTQ), sam, bam or cram.
  • When output is , or more explicitly /dev/stdout, you must specify the output format with a query parameter: -?format=fastq, -?format=fastq&compression=gz (gzip compressed FASTQ), -?format=sam (not usually necessery since this is the default), -?format=bam, -?format=cram.
  • You can also control compression parameters in a similar fashion, see the manual for details.
Lets see some basic commands with Pheniqs..

  • View a gzip compressed FASTQ file as SAM on stdout
pheniqs mux --input BDGGG_s01.fastq.gz|less
@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --input BDGGG_s01.fastq.gz       VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
Currently Pheniqs has only one sub command called mux but there will be more soon. The only argument we use here is -i/--input. Most command line arguments have both a short form and a long form. The short form is easier to use when typing interactively but the long form is better when documenting things because it makes the command more easy to read. Short form arguments can be aggregated, for instance -s -i input.fastq is the same as -si input.fastq. This is all just standard POSIX stuff which Pheniqs tries to adhere to as much as possible.

  • View a gzip compressed FASTQ file as FASTQ on stdout. Here you can see what you can expect to find in those 3 files.
pheniqs mux --input BDGGG_s01.fastq.gz --output "-?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
pheniqs mux --input BDGGG_s02.fastq.gz --output "-?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GGACTCCT
+
B@CCCFC<
pheniqs mux --input BDGGG_s01.fastq.gz --output "-?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
  • Interleave the 3 gzip compressed FASTQ files into one interleaved uncompressed FASTQ file on stdout
pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output "/dev/stdout?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 2:N:0:
GGACTCCT
+
B@CCCFC<
@M02455:162:000000000-BDGGG:1:1101:10000:10630 3:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
-i/--input can show up multiple times on the command line. Notice that the order matters…

pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s03.fastq.gz -i BDGGG_s02.fastq.gz --output "/dev/stdout?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 2:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 3:N:0:
GGACTCCT
+
B@CCCFC<
  • Interleave the 3 gzip compressed FASTQ files into one interleaved SAM output on stdout
pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output "/dev/stdout?format=sam"|less
@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output /dev/stdout?format=sam      VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     FI:i:1  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<        FI:i:2  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     FI:i:3  TC:i:3
  • Ok, now lets interleave the 3 gzip compressed FASTQ files into one interleaved CRAM formatted file
pheniqs mux -i BDGGG_s01.fastq.gz -i BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output BDGGG.cram

When done, Pheniqs will give you a JSON formatted report on stderr. The report will provide statistics on both input and output as well as separate statistics for both overall reads and reads passing quality control (Pass Filter or PF reads). The documentation on the website describes what each variable means. You can also redirect the report to a file with -R/–report. Since no sample classification was requested all reads were directed to the “unclassified” read group.
{
    "incoming": {
        "count": 2500,
        "pf count": 2430,
        "pf fraction": 0.972
    },
    "sample": {
        "classified count": 0,
        "classified fraction": 0.0,
        "classified pf fraction": 0.0,
        "count": 2500,
        "index": 0,
        "pf classified count": 0,
        "pf classified fraction": 0.0,
        "pf count": 2430,
        "pf fraction": 0.972,
        "unclassified": {
            "ID": "undetermined",
            "PU": "undetermined",
            "count": 2500,
            "index": 0,
            "pf count": 2430,
            "pf fraction": 0.972,
            "pf pooled fraction": 1.0,
            "pooled fraction": 1.0
        }
    }
}
To inspect your CRAM file you can, again, use Pheniqs. Essentially this is a format conversion from CRAM to SAM… CRAM is the state-of-the-art compressed version of BAM. It is basically a super compressed SAM file. If you want to archive your data this can save you a LOT of space.

pheniqs mux -i BDGGG.cram|less
@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux -i BDGGG.cram    VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
Or in an interleaved FASTQ

pheniqs mux -i BDGGG.cram -o "-?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GGACTCCT
+
B@CCCFC<
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
Oh wait! This is probably not what you expected?! Pheniqs thinks every one of the records in your file is a new read, when in fact every three records are 3 segments of the same read. To give you absolute control, Pheniqs by default, expects you to explicitly describe the interleaving nature of the input file. You can use the -s/–sense-input flag so ask Pheniqs to try and guess the interleaving nature of the file.

pheniqs mux -si BDGGG.cram|less
@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux -si BDGGG.cram   VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     FI:i:1  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<        FI:i:2  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     FI:i:3  TC:i:3
Om this makes more sense… Here it is in interleaved FASTQ

pheniqs mux -si BDGGG.cram -o "-?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 2:N:0:
GGACTCCT
+
B@CCCFC<
@M02455:162:000000000-BDGGG:1:1101:10000:10630 3:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
Here are the sizes of the files in the different formats.
sizetype
959Kfastq
177Kfastq.gz
1.1Msam
177Kbam
113Kcram
You can see how valuable compression can be. bam and cram files not only provide better compression they can also be read and written faster. The differences can become more pronounced with bigger files. Since Pheniqs can convert between the formats in memory you can store data on disk in a cram file and use Pheniqs to feed FASTQ to tools that accept FASTQ from standard input.

Read manipulation

Pheniqs can also parse and manipulate the content of the read. It uses a syntax similar to python array slicing to tokenize every read segment and assemble new segments from the tokens. The syntax for manipulating the read structure is too complex to provide on a command line, you will need a JSON config file.

Let’s start with a simple theoretical example. Suppose the first 8 nucleotides on the first segment are actually a cellular barcode and you want to separate them into a different segment. We create a configuration file that describes the layout:
{
  "template": {
    "transform": {
      "token": [ "0:8:", "1:0:", "0:0:8", "2:0:" ]
    }
  }
}
pheniqs mux --config manipulate.json -si BDGGG.cram |less
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       TAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     FGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     FI:i:1  TC:i:4
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<        FI:i:2  TC:i:4
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       CTAAGAAA        CCCCCGGG        FI:i:3  TC:i:4
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     FI:i:4  TC:i:4
We used the cram file as input since its faster to read and its also easier to feed Pheniqs with an interleaved file.

Since Pheniqs follows the python array slicing syntax it uses a zero based coordinate system. That means counters start from zero and are inclusive of start coordinate and exclusive of end coordinate. It might sound complicated but it actually makes arithmatics very easy. for instance the first 3 nucleotides on the first read segment are “0:0:3”.

A token is made of 3 colon separated integers. The first is the mandatory zero based input segment index. The second is an inclusive start coordinate and defaults to 0 if omitted. The third is an exclusive end coordinate. If the end coordinate is omitted the token spans to the end of the segment. start coordinate and end coordinate can take positive or negative values to access the segment from either the 5’ (left) or 3’ (right) end and mimic the python array slicing syntax. The two colons are always mandatory.

The template configuration element controls what the output read will look like. In this case we asked it to have 4 segments. The first output segment is “0:8:”, or cycles 8 to end of input segment 0. Pheniqs can validate your config file and describe, in words, what the conversion from input to output will be. here is the description given by

pheniqs mux --config manipulate.json -si BDGGG.cram --validate
Output transform

    Output segment cardinality                  4

    Token No.0
        Length        variable
        Pattern       0:8:
        Description   cycles 8 to end of input segment 0

    Token No.1
        Length        variable
        Pattern       1::
        Description   cycles 0 to end of input segment 1

    Token No.2
        Length        8
        Pattern       0::8
        Description   cycles 0 to 8 of input segment 0

    Token No.3
        Length        variable
        Pattern       2::
        Description   cycles 0 to end of input segment 2

    Assembly instruction
        Append token 0 of input segment 0 to output segment 0
        Append token 1 of input segment 1 to output segment 1
        Append token 2 of input segment 0 to output segment 2
        Append token 3 of input segment 2 to output segment 3

Classifying sample barcodes

Now assume that the 8 nucleotides on the second out of three segments in the input read is actually the sample barcode. That would be the standard Illumina protocol for a single indexed pair-end run. We want to classify the reads by that barcode. If sequencers never made mistakes and reads were perfect that would be a very easy task since you would just need to compare two strings with 8 characters. In reality sequencers make mistakes, between 1 in 1000 and 1 in 10000 on most Illumina sequencers.

Most tools that classify barcodes ignore the quality score the sequencer provides and simply look for either a perfect string match or at most 1 mismatch, if the barcodes in your set are different enough from each other to tolerate the mismatch. This method is called minimum distance decoding, abbreviated MDD, and Pheniqs can do that if you insist. You just set the decoder property "algorithm": "mdd".

But Pheniqs can do something much smarter, it can compute the posterior probability of the correct barcode being decoded given the list of possible barcodes (see the paper for all the mathematical nitty gritty) and make decisions based on that. It can even report that probability for every read in a standard SAM auxiliary field. Think of it as a quality score for decoding the barcode. Except its not just a quality score, its an actual probability! if you had two, independent, barcodes (for instance a sample and cellular barcode) the probability that the entire classification is correct is simply the product of the two! We call this a Phred Adjusted Maximum Likelihood Decoder, abbreviated PAMLD.

PAMLD has much less false negatives, a LOT less than MDD. You can expect it to recover a lot of the reads that MDD is missing. How many depends on how confident you want Pheniqs to be, which you specify in the confidence threshold parameter. But PAMLD does another cool thing, it filters noise. It can tell if a read is more likely to be noise than an actual read. For instance a PhiX read that you spiked into the mixture with your pooled libraries or some human DNA that found its way into your sample (how did that get here?!). Benchmarks on simulated data tell us that it is VERY good at picking up noise.

Let’s look at a config file for decoding the sample barcode.
{
    "input": [ "BDGGG.cram", "BDGGG.cram", "BDGGG.cram" ],
    "template": {
        "transform": {
            "token": [
                "0::",
                "2::"
            ]
        }
    },
    "sample": {
        "algorithm": "pamld",
        "confidence threshold": 0.99,
        "noise": 0.01,
        "transform": {
            "token": [ "1::8" ]
        },
        "codec": {
            "@AGGCAGAA": {
                "LB": "trinidad 5",
                "barcode": [ "AGGCAGAA" ],
                "concentration": 1
            },
            "@CGTACTAG": {
                "LB": "trinidad 4",
                "barcode": [ "CGTACTAG" ],
                "concentration": 1
            },
            "@GGACTCCT": {
                "LB": "trinidad 9",
                "barcode": [ "GGACTCCT" ],
                "concentration": 1
            },
            "@TAAGGCGA": {
                "LB": "trinidad 1",
                "barcode": [ "TAAGGCGA" ],
                "concentration": 1
            },
            "@TCCTGAGC": {
                "LB": "trinidad 8",
                "barcode": [ "TCCTGAGC" ],
                "concentration": 1
            }
        }
    }
}
Since we don’t want the sample barcode to be part of the output read we skip that segment in the template transform, we only want the first and last segment in the output.

Next we declare a sample instruction in the config file. The transform in the sample instruction tells Pheniqs the barcode is on the first 8 nucleotides of the second input segment. The codec section is where you specify details on the barcodes you expect to find in your data. You naturally specify the barcode sequence, or multiple sequences if for instance you use dual indexing, but really you can have any number of segments in your barcodes, as long as their length and structure matches what you specified in the transform. Pheniqs has good error handling, if your config file has an error it will tell you exactly what went wrong.

For instance if in the first barcode you only specified 7 nucleotides:
    "@AGGCAGAA": {
        "LB": "trinidad 5",
        "barcode": [ "AGGCAGA" ],
        "concentration": 1
    },
Configuration error : sample decoder : expected 8 but found 7 nucleotides in segment 0 of barcode @AGGCAGAA
or if you copy/pasted your barcodes wrong and had the first one specified twice:
    "@AGGCAGAA": {
        "LB": "trinidad 5",
        "barcode": [ "CGTACTAG" ],
        "concentration": 1
    },
    "@CGTACTAG": {
        "LB": "trinidad 4",
        "barcode": [ "CGTACTAG" ],
        "concentration": 1
    },
Configuration error : sample decoder : duplicate barcode sequence CGTACTAG
We specified the input in the config file instead of the command line, so we don’t have to keep typing it. We also specified how to parse the input explicitly, instead of using the –sense-input, by specifying the interleaved input file 3 times, telling Pheniqs to pull three records for every read, or that it should expect to find 3 consecutive segments for every read in the file. being able to explicitly specify how to parse the input allows you to deal with malformed files (lets face it, bioinformatics isn’t perfect).

Notice we specified a 0.99 confidence threshold, so any barcode with a posterior decoding probability lower than that will be marked as failing quality control and moved to the unclassified (or undetermined) group. confidence threshold defaults to 0.95 but it represents a tradeoff between false positives and false negatives and the ideal value depends on your data and your barcode set, and frankly, what you want. We found 0.95 to be a fair tradeoff but if you prefer to loose more reads but be super confident about them you can set higher values and if you feel that your downstream analysis will be able to detect false positives and really want to recover as many as possible reads you can set lower values.

PAMLD’s Bayesian model depends on a set of prior probabilities, basically how much noise (the noise variable) and how much of each sample (the concentration variable in each barcode) you expect to find in your data. Noise defaults to 0.01, which is 1%, which is the amount of PhiX textbook Illumina runs have spiked in. But if you are doing anything other than a vanilla Illumina protocol you probably have more. if you leave out the concentration they all default to 1, essentially forming a uniform distribution. Pheniqs will normalize what ever you specify in concentration so that they all sum up to 1 – noise. For instance if you specify 1 in all your barcodes and 0.01 in noise and you had 5 samples they will each be assumed to have a 0.99 / 5 = 0.198 prior. What we have in the above config file could have just beed left out since its the default.

If you have a good guess you can specify your priors in the config. We will see later that Pheniqs can give you a pretty good estimate for them.

First lets see how the output looks
pheniqs mux --config decode.json |less
@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@RG     ID:AGGCAGAA     BC:AGGCAGAA     LB:trinidad 5   PU:AGGCAGAA
@RG     ID:CGTACTAG     BC:CGTACTAG     LB:trinidad 4   PU:CGTACTAG
@RG     ID:GGACTCCT     BC:GGACTCCT     LB:trinidad 9   PU:GGACTCCT
@RG     ID:TAAGGCGA     BC:TAAGGCGA     LB:trinidad 1   PU:TAAGGCGA
@RG     ID:TCCTGAGC     BC:TCCTGAGC     LB:trinidad 8   PU:TCCTGAGC
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --config decode.json     VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.0616e-06
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.0616e-06
M02455:162:000000000-BDGGG:1:1101:10005:7907    589     *       0       0       *       *       0       0       CCGTTTGAATGTTGACGGGATGAACATAATAAGCAATGACGGCAGCAATAA     CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGG     RG:Z:undetermined       BC:Z:ATGTTTAT   QT:Z:@----6,,
M02455:162:000000000-BDGGG:1:1101:10005:7907    653     *       0       0       *       *       0       0       GAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCG     CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGG     RG:Z:undetermined       BC:Z:ATGTTTAT   QT:Z:@----6,,
Pheniqs has declared a read group for every sample barcode and one for the undetermined. The BC and QT tags hold the values for the uncorrected sample barcode and the quality scores for it. RG is a reference to the Read Group ID as declared in the header. Those are all standard tags, you will find them in the SAM specifications. XB is the special Pheniqs sauce, it is the probability that the sample barcode decoding was incorrect, the error probability. In the second read (remember, this is an interleaved file, every two records are a read. You can see they have the same read ID) decoding failed. BC and QT are there but RG points to undetermined and there is no XB. If you will also notice the flag field is different since the bit marking this read as failing QC is on.

You can add Read Group metadata to your config file and it will end up where you expect it to. You already saw an example of that with the LB tag. for instance we add some global ones so they are added to all Read Groups.
{
    "CN": "CGSB",
    "DT": "2018-02-25T07:00:00+00:00",
    "PI": "300",
    "PL": "ILLUMINA",
    "PM": "miseq",
    "SM": "trinidad",
    "input": [ "BDGGG.cram", "BDGGG.cram", "BDGGG.cram" ],
    .
    .
    .
}
and we get a richer header
@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined CN:CGSB DT:2018-02-25T07:00:00+00:00    PI:300  PL:ILLUMINA     PM:miseq        PU:undetermined SM:trinidad
@RG     ID:AGGCAGAA     BC:AGGCAGAA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 5   PI:300  PL:ILLUMINA     PM:miseq        PU:AGGCAGAA     SM:trinidad
@RG     ID:CGTACTAG     BC:CGTACTAG     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 4   PI:300  PL:ILLUMINA     PM:miseq        PU:CGTACTAG     SM:trinidad
@RG     ID:GGACTCCT     BC:GGACTCCT     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 9   PI:300  PL:ILLUMINA     PM:miseq        PU:GGACTCCT     SM:trinidad
@RG     ID:TAAGGCGA     BC:TAAGGCGA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 1   PI:300  PL:ILLUMINA     PM:miseq        PU:TAAGGCGA     SM:trinidad
@RG     ID:TCCTGAGC     BC:TCCTGAGC     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 8   PI:300  PL:ILLUMINA     PM:miseq        PU:TCCTGAGC     SM:trinidad
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --config decode.json     VN:2.1.0

Estimating the priors

Suppose you have no idea what the priors are. You THINK you add the same amount of each pooled library but who know what actually happened. You also have no idea how much noise is there…

You make an initial Pheniqs run with your best guess, or with the defaults, and specify the --prior command line argument and provide a path to a new config file. Pheniqs will use the run to estimate the relative proportions of barcodes and noise in your data and write a new config file that contains the estimated priors. Since you don’t really want the actual output reads from this run you can save some time by using another nifty Pheniqs feature: you specify the output to be /dev/null basically telling Pheniqs you don’t want the output. That will save quite a bit of time since Pheniqs won’t waste time formatting the output or writing it anywhere.

pheniqs mux --config decode.json --prior decode_with_prior.json --output /dev/null 
when its done running you will find a new config file at decode_with_prior.json.
{
    "CN": "CGSB",
    "DT": "2018-02-25T07:00:00+00:00",
    "PI": "300",
    "PL": "ILLUMINA",
    "PM": "miseq",
    "SM": "trinidad",
    "input": [
        "BDGGG.cram",
        "BDGGG.cram",
        "BDGGG.cram"
    ],
    "output": [
        "/dev/null"
    ],
    "prior adjusted job url": "decode_with_prior.json?format=json",
    "sample": {
        "algorithm": "pamld",
        "codec": {
            "@AGGCAGAA": {
                "LB": "trinidad 5",
                "barcode": [
                    "AGGCAGAA"
                ],
                "concentration": 0.186575935178905
            },
            "@CGTACTAG": {
                "LB": "trinidad 4",
                "barcode": [
                    "CGTACTAG"
                ],
                "concentration": 0.198520089969607
            },
            "@GGACTCCT": {
                "LB": "trinidad 9",
                "barcode": [
                    "GGACTCCT"
                ],
                "concentration": 0.211699846980038
            },
            "@TAAGGCGA": {
                "LB": "trinidad 1",
                "barcode": [
                    "TAAGGCGA"
                ],
                "concentration": 0.219937195111557
            },
            "@TCCTGAGC": {
                "LB": "trinidad 8",
                "barcode": [
                    "TCCTGAGC"
                ],
                "concentration": 0.168041901882987
            }
        },
        "confidence threshold": 0.99,
        "noise": 0.015225030876904,
        "transform": {
            "token": [
                "1::8"
            ]
        }
    },
    "template": {
        "transform": {
            "token": [
                "0::",
                "2::"
            ]
        }
    }
}
Notice the concentration and noise parameters have been adjusted to reflect what Pheniqs guessed from the initial run. The actual methodology for this prior estimate is described in the paper and in the manual. Notice the method is recursive since it uses your initial values as the base for the estimation. you can potentially run it a second time and get an even more accurate estimate but you also run the risk of over training your priors, but this is a different discussion… our benchmarks on synthetic data showed that a single estimation that started from a uniform distribution on data that was VERY non uniform still yielded a significant improvement in both precision and recall, and was just shy of using the real (and obviously unknown when your data is real and not simulated) priors.

one small minor detail is that since you executed Pheniqs with a null output your new config also specifies a null output. you can either edit the config or override it on the command line when you use this new config. pheniqs mux --config decode_with_prior.json -o -|less
@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined CN:CGSB DT:2018-02-25T07:00:00+00:00    PI:300  PL:ILLUMINA     PM:miseq        PU:undetermined SM:trinidad
@RG     ID:AGGCAGAA     BC:AGGCAGAA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 5   PI:300  PL:ILLUMINA     PM:miseq        PU:AGGCAGAA     SM:trinidad
@RG     ID:CGTACTAG     BC:CGTACTAG     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 4   PI:300  PL:ILLUMINA     PM:miseq        PU:CGTACTAG     SM:trinidad
@RG     ID:GGACTCCT     BC:GGACTCCT     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 9   PI:300  PL:ILLUMINA     PM:miseq        PU:GGACTCCT     SM:trinidad
@RG     ID:TAAGGCGA     BC:TAAGGCGA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 1   PI:300  PL:ILLUMINA     PM:miseq        PU:TAAGGCGA     SM:trinidad
@RG     ID:TCCTGAGC     BC:TCCTGAGC     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 8   PI:300  PL:ILLUMINA     PM:miseq        PU:TCCTGAGC     SM:trinidad
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --config decode_with_prior.json -o -     VN:2.1.0-6-gb006bb7a4990e86e62e4f0085e405662e88daef6
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.10298e-06
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.10298e-06

Splitting output

Interleaved files have many advantages: single input and single output allow you to use standard POSIX streams and avoid storing intermediate files. Its faster to read and write. It keeps experimental directory structures simpler and makes archives easier to browse. But sometimes you need to split your reads over multiple files, either by read segment or by barcode. Pheniqs can do that.

In the same fashion that the input directive allowed you to dictate multiple files to pull read segments from the output directive allows you to split the segments over multiple files.

Read segment Splitting

If your output read has multiple segments you can either specify just one output, in which case all segments will be interleaved into that one output. Or you can specify as many outputs as you have output segments and the segments will be written to the seperate files.

either on the command line
pheniqs mux --config decode.json --prior decode_with_prior.json --output first_segment.fastq.gz -o second_segment.bam 
or in the config file
{
    "output": [
        "first_segment.fastq.gz",
        "second_segment.bam"
    ]
}
Each output file is evaluated independently so they don’t have to have the same format!

Barcode Splitting

You can also split reads into multiple files by barcodes. Either sample, cellular or molecular barcodes as long as you only split by one decoder. To do that you specify the output directive in the config section for each barcode.
{
    "CN": "CGSB",
    "DT": "2018-02-25T07:00:00+00:00",
    "PI": "300",
    "PL": "ILLUMINA",
    "PM": "miseq",
    "SM": "trinidad",
    "input": [
        "BDGGG.cram",
        "BDGGG.cram",
        "BDGGG.cram"
    ],
    "output": [
        "BDGGG_default.bam"
    ],
    "prior adjusted job url": "decode_with_prior.json?format=json",
    "sample": {
        "algorithm": "pamld",
        "codec": {
            "@AGGCAGAA": {
                "LB": "trinidad 5",
                "barcode": [
                    "AGGCAGAA"
                ],
                "concentration": 0.186575935178905,
                "output": [
                    "BDGGG_AGGCAGAA.bam"
                ]
            },
            "@CGTACTAG": {
                "LB": "trinidad 4",
                "barcode": [
                    "CGTACTAG"
                ],
                "concentration": 0.198520089969607,
                "output": [
                    "BDGGG_CGTACTAG_s01.fastq.gz",
                    "BDGGG_CGTACTAG_s02.fastq"
                ]
            },
            "@GGACTCCT": {
                "LB": "trinidad 9",
                "barcode": [
                    "GGACTCCT"
                ],
                "concentration": 0.211699846980038
            },
            "@TAAGGCGA": {
                "LB": "trinidad 1",
                "barcode": [
                    "TAAGGCGA"
                ],
                "concentration": 0.219937195111557
            },
            "@TCCTGAGC": {
                "LB": "trinidad 8",
                "barcode": [
                    "TCCTGAGC"
                ],
                "concentration": 0.168041901882987
            }
        },
        "confidence threshold": 0.99,
        "noise": 0.015225030876904,
        "transform": {
            "token": [
                "1::8"
            ]
        }
    },
    "template": {
        "transform": {
            "token": [
                "0::",
                "2::"
            ]
        }
    }
}
You can still set a global output and data from any barcode that has not declared its own output directive will be routed to the global output. You can have different layouts for each barcode. When we said Pheniqs is flexible, we meant it 😉

Building R Packages

Hands On

Load Required Packages
# devtools automates some of the setup for a new package
library(devtools)

# roxygen2 handles building the documentation
library(roxygen2)
Define Some Variables
# Define package directory
package_wd <- "/home/mk5636/badas-r-package"

# Define package name as string
package_name <- 'badas'
Create A Package Skeleton Using create()
dir.create(file.path(package_wd))
setwd(file.path(package_wd))

# The 'create' function from devtools creates the
# directory structures and skeleton help files
# for a new package
create(package_name)
Examine at the directory structure and DESCRIPTION file that was created by the create() function

Create a Function

Create an R function with a roxygen2-style header (for documentation). This will be incorporated into the package.
#' custom_add
#'
#' A custom function to add two numbers together
#'
#' @name custom_add
#' @param x The first number.
#' @param y The second number.
#' @return The result of adding the two numbers.
#' @export
#' @examples
#' result <- custom_add(1,2)
custom_add <- function(x,y){
    return(x+y)
}
Let’s test it out
custom_add(1, 2)
## [1] 3
Copy and paste the function into a file called {function_name}.R in {package_wd}/R/

Add data to a function

# Name the data
data_name <- 'mydata'

# Create the data object
data_obj <- c(1,2)

# Assign the data name to the data object
assign(x=data_name, value=data_obj)

# Create the data folder
data_path <- paste(package_wd, "/", package_name, "/data/" ,sep="")
dir.create(file.path(data_path))

# Write .rda file to data folder
data_file <- paste(data_path, data_name, ".rda", sep="")
save(list=data_name, file=data_file)
Create the documentation for the data
#' mydata
#'
#' A vector containing some test data.
#'
#' @format A vector with 2 values: c(1,2)
#' @source \\url{http://www.flybase.org}
Copy and paste the above into a file called {data_name}.R in {package_wd}/R/

Document, Build, and Install the Package

setwd(file.path(package_wd))

# The 'document' function from devtools is a wrapper for the
# roxygen2::roxygenize() function from the roxygen2 package
# It builds all documentation for a package
document(package_name)
# The 'build' function from devtools converts a
# package source directory into a single bundled file. 
# Returns the path to the built package
package_tar <- build(package_name)
# Let's install the package now
install.packages(package_tar, repos=NULL, type='source')

Quit your R session, start a new session, and try the following commands

This shouldn’t work
custom_add(1,2)
Error in custom_add(1, 2) : could not find function "custom_add"
This should work:
library(badas)
custom_add(1,2)
## [1] 3
See how the roxygen2 header we created for the function translates into documentation
?custom_add

Congratulations, you have successfully built an R package! Now you can share the tarball with collaborators to allow them to use your package. Next we will learn about how to submit packages to CRAN and how to install packages from GitHub.

Submitting to CRAN

Submitting to CRAN used to be very hard: creating documentation by hand, manually checking everything, etc. But with tools like devtools, roxygen2, etc. it has become much easier.

1) Might be a good idea to check the package name you want to use is not already taken. You can search the CRAN package list, or you can use the package available.

library(available)
available("available", browse = FALSE)
You can also use suggest to help you come up with a name, ex:
suggest("Convert Yeast IDs")
## [1] convertr
You probably want to do this before you build your package.

2) Run the check_built() function to ensure:

  • package can be built, installed, and uninstalled without issues
  • no potential conflicts, plays nice with other packages
  • required documentation present
  • correct formatting
  • no warnings
  • no notes
check_built(package_tar)
If there are notes, you need to explain them in the comments section when you submit your package.

3) Once you’re ready, submission is done through the web:
https://cran.r-project.org/submit.html

Upload the tarball, which must have been built using the build() function as we did, along with your name and email address, and comments optionally (here is where you would explain any notes from the check_built() function for example).

Please note: You should read the policies and ensure your package is compliant prior to submitting:
https://cran.r-project.org/web/packages/policies.html

Installing Packages from GitHub

library(devtools)
install_github("GreshamLab/labtools")

JBrowse: Visualizing Data Quickly & Easily

Watch the September 24, 2020 BADAS here.

Introduction

During the 2020 Summer of COVID-19, the Ghedin and Gresham labs sequenced SARS-CoV-2 isolates. To visualize and share this data with collaborators the web-based genome visualization software JBrowse was used https://covid-19.bio.nyu.edu.
To benefit all researchers at NYU engaged in genomics research, a centralized JBrowse service has been published at http://jbrowse.bio.nyu.edu/ for PIs and their lab members.

Demo

We’ll go through a demo of 
  • Uploading a dataset
  • How to manipulate the dataset visualization
  • Some features
  • Sharing URLs
Login to the HPC Prince cluster where there is small sample dataset to upload and manipulate. This dataset is located in /beegfs/eb167/yeast. You can see the standard files you are familiar with. There is one file specific to JBrowse and that is the samplelist text file. It is used to tell JBrowse how to group the files after upload, saving you the hassle of manually manipulating the configuration file. To upload, run the following but only change the part in bold and replace with your netID.
cgsb_upload2jbrowse -p demo -d netID -f /beegfs/eb167/yeast
You will be prompted for your password twice in this process, but once you request access to a PI lab passwordless authentication will be configured for you. Let me explain what this command does and the options available.
USAGE: cgsb_upload2jbrowse -p PI -d DATASET [-f FOLDER] [-s SAMPLELIST] [FILES] 
-------------------------------------------------------
-p | --PI                specify PI 
-d | --dataset           specify data set 
-f | --folder            specify folder containing files 
-s | --samplelist        specify sample list for categorization 
-------------------------------------------------------
File formats supported: 
- fa 
- fasta 
- fna 
- vcf.gz* 
- bam* 
- bam.bw 
- cram*  
- gff3.gz* 
 *Requires index file (tbi, bai, crai) of the same base name 
Execution of this script will rsync or upload the specified files or folders to the appropriate JBrowse folder, create basic configurations for your dataset, and publish the site. The required fields are -p/–PI for the PI you are a member of and -d/–dataset is for the dataset name that will be visible on the public site. Optional arguments are -f/–folder to recursively upload a folder and its contents as we just did and -s/–samplelist if the file is not named ‘samplelist’. You’ll notice after running the command and if successfully completing you’ll be given a URL link. Something like: https://jbrowse.bio.nyu.edu/demo/?data=data/netID

Other Uploading Options:

Example 2 –

To transfer your data within your scratch (/scratch/user/project1/data) along with the reference data in the prince shared genome repository folders to Smith’s project1 data set run the following.
cgsb_upload2jbrowse -p Smith -d project1 \
 -f /scratch/user/project1/data  \   /scratch/work/cgsb/genomes/Public/Fungi/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa \    /scratch/work/cgsb/genomes/Public/Fungi/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Saccharomyces_cerevisiae.R64-1-1.34.gff3

Example 3 –

To transfer from outside the Prince cluster
# Transfer the files
rsync --progress -ruv /path/to/dataset/ \ 
     <NYUnetID>@jbrowse.bio.nyu.edu:/jbrowse/<PI>/<DATASET>
# Build and publish the tracks based on the files uploaded
ssh <NYUnetID>@jbrowse.bio.nyu.edu addTracks --PI <PI> --dataset <DATASET>
The data will be accessible immediately on the JBrowse server. Choose your PI on the JBrowse homepage’s dropdown menu then the data set name that was specified in the previous step. Once accessed you will be able to display visualizations or tracks for each file. These tracks by default will be named after the file itself.
Let’s go to the URL of our uploaded yeast dataset. 
On the left we see two blocks labeled combined_ntr_21 and combined_ntr_22. This again was possible because we had a samplelist file listing the prefix used for grouping. If one is not provided it will look like this https://jbrowse.bio.nyu.edu/demo/?data=data/badas-noList.
The available tracks will be selectable on the left allowing you to display only items of interest and their order displayed. 
If you go to the Track menu at the top of the page, you have two options to create a combination track combining 2 tracks or a sequence search track, which shows regions of the referenced sequence or its translations that match a DNA or amino acid sequence.
We can search for features of interest with the search bar at the top.
To the right of the search box we have the highlight button. We can highlight areas of note as well. This is beneficial as the URL will dynamically change based on the current view which includes the highlights. 
You can then use this unique URL to share with colleagues or post in publications. Again this site is open to the public without need for VPN.

Customizing Tracks

We have configured the site to ingest your data in the JBrowse default fashion. You can edit the tracks configuration file (tracks.conf) to meet your needs. For more information on what is possible visit https://jbrowse.org/docs/installation.html

Next Steps

Requesting Permissions

For access, request access through the biology department forms portal at https://forms.bio.nyu.edu. This is separate from HPC access and will require approval from the PI. 
The primary reason being you will have ability to alter or delete all datasets associated with the PI.  The data is not backed up. The only thing that is backed up nightly are the configuration files.

GATK Pipeline

We have integrated automated JBrowse visualization into the GATK pipeline, which performs alignment and variant calling. Within your nextflow.config file add the following lines to specify the data set name and the PI.
// JBrowse params
params.do_jbrowse = true
params.gff = "/scratch/work/cgsb/genomes/Public/Fungi/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Saccharomyces_cerevisiae.R64-1-1.34.gff3"
params.jbrowse_pi = "Smith"
params.dataset_name = "project1"

Intro to ggplot

Load packages

require(ggplot2)

How ggplot works

The basic structure of a plot

Let’s load in some data that we can use for plotting: the iris dataset, a built-in dataset in R that contains petal and sepal dimensions of various individuals from three different iris species.
print(iris)
Let’s use base R to plot the Sepal Length vs Sepal Width for all the data
plot(iris$Sepal.Length, iris$Sepal.Width)
The same plot in ggplot is a bit more complicated to put together:
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point()
The above contains the components that are the bare minimum of what we need for a ggplot plot; we can add more on later, but let’s dissect the parts of this command:
ggplot(data = <DATA>, mapping = aes(<Mapping>)) +
        <GEOM_FUNCTION>()
  • arguments:
    • data: the dataframe you want to plot
    • mapping: Any variables from your data that affect plot output, listed in aes( )
  • commands:
    • ggplot( ): required start of every ggplot command. Contains any options that we want to apply to the whole plot (which can be nothing)
    • geom_{something}( ): how you’re plotting the data. Here, we want to plot points, so we’re using geom_point; there are tons of different geoms available, one for each type of plot you might want to make.
Arguments like data and mapping can go in the parentheses after the geom, producing the same plot as above:
ggplot() +
  geom_point(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width))
But there are specific situations in which it’s better to do this (we’ll see them later)

Modifying geom properties

We can also pass additional arguments to the geom: useful ones to know are:
  • color: line color; for the default shape used in geom_point, this actually colors the inside of the shape as well
  • fill: the fill color inside a shape
  • size: point size or line thickness
  • shape: for points, this is the shape; for lines, this is the line pattern or dashyness
  • alpha: transparency level, with 0 being totally transparent and 1 being a solid, opaque color
For example:
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(color = 'blue', fill = 'yellow', shape = 23, alpha = 0.33, size = 5)
Why are some of these rhombuses darker than others? Note that any arguments that universally affect the properties of the points, lines, etc that we’re plotting, like the ones we used above, must be passed to the relevant geom, not to the ggplot( ) command. This is because the geom is in charge of making the points!

Mapping lots of variables

The plot we made above isn’t really all that useful. It’s great to see the data across all three species on one plot, but if we’re looking at this data, we’re probably actually interested in how these species differ from each other. So how do we make ggplot visually separate the points by species? Remember that the mapping argument deals with any properties of the plot that depend on variables in the supplied data frame. So we can modify our original code like this:
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(alpha = 0.33)
# Can also be written as:
ggplot(data = iris) +
  geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species), alpha = 0.33)
Notice that the plot above uses both a variable-dependent color (based on the iris dataframe’s Species column), which goes inside aes( ), and a variable-independent alpha value that applies to the whole geom_point command and goes outside aes( ) Also, notice that you got a legend for free! You didn’t have to tell ggplot how to make it, or what info to include in it; it knows automatically based on how you set up your mapping. Depending on context, you can make color, fill, shape, size or alpha variable-dependent. Some of these (color, fill, shape) obviously make more sense for categorical variables, while others (alpha, size) make more sense for continuous variables, but ggplot will only rarely stop you from making aesthetically and data representationally questionable choices here. Let’s try an exercise: based on the code above, make a plot where Sepal.Length is on the x axis, Sepal.Width is on the y axis, all the points are colored red, the shape of the point depends on the Species, and the point size depends on Petal.Width Questionable usefulness, but hey, it’s possible and pretty easy…

Stacking multiple geoms

One of the places where ggplot really shines is when you want to combine multiple data representations on one plot. For example, I really like topology-style contour plots, which ggplot can make with geom_density2d. Once we know how to make a basic plot, and combining a contour plot with a plot the individual data points is super easy in ggplot:
# note, the first two lines are just our plot from above
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_density2d() +
  geom_point(alpha = 0.33)
Notice that the alpha argument we provided only applies to geom_point, so the contour lines don’t show any transparency. However, any arguments provided to mapping in an aes( ) statement in the ggplot( ) command apply across all geoms. (Also, notice that when we add a geom, ggplot automatically updates our legend!) One really powerful application of this is that we can actually make each geom( ) represent a different aspect of the same data. Let’s say we’d like our datapoints to be colored by species, but we’d also like to see a contour plot of sepal length vs width across all the species. To do this, we’re going to have to move our mapping calls inside the geoms, since we now want each geom to map the data differently:
# got rid of alpha here just to simplify things
ggplot(data = iris) +
  geom_density2d(aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(aes(x = Sepal.Length, y = Sepal.Width, color = Species))