Pheniqs

What is Pheniqs and why do you need it?

Pheniqs is a fast and accurate high-throughput sequencing barcode classifier.

It can convert sequence reads to and from FASTQ, SAM, BAM and CRAM formats, decode barcodes and manipulate read layout in a single process.
It is very accurate, much more than traditional minimum distance based methods that simply compare the number of mismatches between two sequences. More specifically is has lower False Discovery Rate and at the same time a much lower Miss Rate. In simple terms it misclassifies less reads, introduces less noise and at the same time recovers more reads.

This plot compares false discovery rates and miss rates between Pheniqs and traditional minimum distance decoding. The X axis shows increasing average error rates. In the error range relevant to most Illumina runs, which is between 1 in 10000 and 1 in 1000, Pheniqs has lower false discovery rates than MDD while still maintaining a much lower Miss Rate.

It is very fast. A benchmark of decoding 94 standard Illumina dual indexed barcodes in more than 11.5 billion reads on a CPU with 14 cores completed in just over 3 hours. deML for comparison, another statistical classifier that is still much faster than python implementations, took almost 80 hours to complete the same task.
It can handle complex barcode designs with a flexible syntax that can address any number of barcodes anywhere on the read. That means no need for pre or post processing.

In this diagram you can see how Pheniqs uses a syntax similar to python array slicing to address sample, cellular and molecular barcodes on a read made out of 2 biological segments and two index segments and produde a SAM output.

It reports a decoding error probability for every barcode so you can take those metrics into account in your downstream analysis.
It is actively developed and can be extended with new decoding algorithms to match new experimental designs. This is an open invitation for everyone to help us make it better by telling us what features you are missing.
website: https://biosails.github.io/pheniqs/
publication: https://rdcu.be/ctRng
installing Pheniqs on the greene NYU cluster: https://gist.github.com/moonwatcher/e1e76978afc51560ff0a74c881705f36

Installing with conda

Pheniqs can be installed in many ways. You can install it with conda, you can use the pheniqs-build-api to build your own portable binary or you can build it from source. You can have a look here for details.

But the good people at the hpc have created a module for us on greene so we will use this for this demo because it takes exactly two seconds to use it 🙂 module load pheniqs/2.1.0

Example files

In this tutorial we will be using a small toy example with 2500 reads. The reads are pair-ended single indexed and were produced by a MiSeq instrument: Two biological segments that are 51 nucleotides long and a single index that is 8 nucleotides long. You are probably not used to seeing Illumina output this way since samples are often demultiplexed by the core facility but the raw output from the sequencer includes each segment in a gzip compressed fastq file with full read quality data.

You can get the three gzip compressed FASTQ files here: BDGGG_s01.fastq.gz, BDGGG_s02.fastq.gz, BDGGG_s03.fastq.gz.

Format conversions and interleaving

Both input and output can be FASTQ or SAM/BAM/CRAM in either split or interleaved lacing.
Input format is automatically detected from file content, not from the extension, so files without an extension or even with a wrong extension are fine (though still not good practice).
Output format defaults to uncompressed SAM on stdout.
Output format is determined by the output file extension: fastq, fastq.gz (gzip compressed FASTQ), sam, bam or cram.
When output is –, or more explicitly /dev/stdout, you must specify the output format with a query parameter: -?format=fastq, -?format=fastq&compression=gz (gzip compressed FASTQ), -?format=sam (not usually necessery since this is the default), -?format=bam, -?format=cram.
You can also control compression parameters in a similar fashion, see the manual for details.

Lets see some basic commands with Pheniqs..

View a gzip compressed FASTQ file as SAM on stdout

pheniqs mux --input BDGGG_s01.fastq.gz|less

@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --input BDGGG_s01.fastq.gz       VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG

Currently Pheniqs has only one sub command called mux but there will be more soon. The only argument we use here is -i/--input. Most command line arguments have both a short form and a long form. The short form is easier to use when typing interactively but the long form is better when documenting things because it makes the command more easy to read. Short form arguments can be aggregated, for instance -s -i input.fastq is the same as -si input.fastq. This is all just standard POSIX stuff which Pheniqs tries to adhere to as much as possible.

View a gzip compressed FASTQ file as FASTQ on stdout. Here you can see what you can expect to find in those 3 files.

pheniqs mux --input BDGGG_s01.fastq.gz --output "-?format=fastq"|less

@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG

pheniqs mux --input BDGGG_s02.fastq.gz --output "-?format=fastq"|less

@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GGACTCCT
+
B@CCCFC<

pheniqs mux --input BDGGG_s01.fastq.gz --output "-?format=fastq"|less

@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG

Interleave the 3 gzip compressed FASTQ files into one interleaved uncompressed FASTQ file on stdout

pheniqs mux --input BDGGG_s01.fastq.gz --input 
BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output 
"/dev/stdout?format=fastq"|less

@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 2:N:0:
GGACTCCT
+
B@CCCFC<
@M02455:162:000000000-BDGGG:1:1101:10000:10630 3:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG

-i/--input can show up multiple times on the command line. Notice that the order matters…

pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s03.fastq.gz -i BDGGG_s02.fastq.gz --output "/dev/stdout?format=fastq"|less

@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 2:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 3:N:0:
GGACTCCT
+
B@CCCFC<

Interleave the 3 gzip compressed FASTQ files into one interleaved SAM output on stdout

pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output "/dev/stdout?format=sam"|less

@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output /dev/stdout?format=sam      VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     FI:i:1  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<        FI:i:2  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     FI:i:3  TC:i:3

Ok, now lets interleave the 3 gzip compressed FASTQ files into one interleaved CRAM formatted file

pheniqs mux -i BDGGG_s01.fastq.gz -i BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output BDGGG.cram

When done, Pheniqs will give you a JSON formatted report on stderr. The report will provide statistics on both input and output as well as separate statistics for both overall reads and reads passing quality control (Pass Filter or PF reads). The documentation on the website describes what each variable means. You can also redirect the report to a file with -R/–report. Since no sample classification was requested all reads were directed to the “unclassified” read group.

{
    "incoming": {
        "count": 2500,
        "pf count": 2430,
        "pf fraction": 0.972
    },
    "sample": {
        "classified count": 0,
        "classified fraction": 0.0,
        "classified pf fraction": 0.0,
        "count": 2500,
        "index": 0,
        "pf classified count": 0,
        "pf classified fraction": 0.0,
        "pf count": 2430,
        "pf fraction": 0.972,
        "unclassified": {
            "ID": "undetermined",
            "PU": "undetermined",
            "count": 2500,
            "index": 0,
            "pf count": 2430,
            "pf fraction": 0.972,
            "pf pooled fraction": 1.0,
            "pooled fraction": 1.0
        }
    }
}

To inspect your CRAM file you can, again, use Pheniqs. Essentially this is a format conversion from CRAM to SAM… CRAM is the state-of-the-art compressed version of BAM. It is basically a super compressed SAM file. If you want to archive your data this can save you a LOT of space.

pheniqs mux -i BDGGG.cram|less

@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux -i BDGGG.cram    VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG

Or in an interleaved FASTQ

pheniqs mux -i BDGGG.cram -o "-?format=fastq"|less

@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GGACTCCT
+
B@CCCFC<
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG

Oh wait! This is probably not what you expected?! Pheniqs thinks every one of the records in your file is a new read, when in fact every three records are 3 segments of the same read. To give you absolute control, Pheniqs by default, expects you to explicitly describe the interleaving nature of the input file. You can use the -s/–sense-input flag so ask Pheniqs to try and guess the interleaving nature of the file.

pheniqs mux -si BDGGG.cram|less

@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux -si BDGGG.cram   VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     FI:i:1  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<        FI:i:2  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     FI:i:3  TC:i:3

Om this makes more sense… Here it is in interleaved FASTQ

pheniqs mux -si BDGGG.cram -o "-?format=fastq"|less

@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 2:N:0:
GGACTCCT
+
B@CCCFC<
@M02455:162:000000000-BDGGG:1:1101:10000:10630 3:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG

Here are the sizes of the files in the different formats.

size	type
959K	fastq
177K	fastq.gz
1.1M	sam
177K	bam
113K	cram

You can see how valuable compression can be. bam and cram files not only provide better compression they can also be read and written faster. The differences can become more pronounced with bigger files. Since Pheniqs can convert between the formats in memory you can store data on disk in a cram file and use Pheniqs to feed FASTQ to tools that accept FASTQ from standard input.

Read manipulation

Pheniqs can also parse and manipulate the content of the read. It uses a syntax similar to python array slicing to tokenize every read segment and assemble new segments from the tokens. The syntax for manipulating the read structure is too complex to provide on a command line, you will need a JSON config file.

Let’s start with a simple theoretical example. Suppose the first 8 nucleotides on the first segment are actually a cellular barcode and you want to separate them into a different segment. We create a configuration file that describes the layout:

{
  "template": {
    "transform": {
      "token": [ "0:8:", "1:0:", "0:0:8", "2:0:" ]
    }
  }
}

pheniqs mux --config manipulate.json -si BDGGG.cram |less

M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       TAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     FGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     FI:i:1  TC:i:4
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<        FI:i:2  TC:i:4
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       CTAAGAAA        CCCCCGGG        FI:i:3  TC:i:4
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     FI:i:4  TC:i:4

We used the cram file as input since its faster to read and its also easier to feed Pheniqs with an interleaved file.

Since Pheniqs follows the python array slicing syntax it uses a zero based coordinate system. That means counters start from zero and are inclusive of start coordinate and exclusive of end coordinate. It might sound complicated but it actually makes arithmatics very easy. for instance the first 3 nucleotides on the first read segment are “0:0:3”.

A token is made of 3 colon separated integers. The first is the mandatory zero based input segment index. The second is an inclusive start coordinate and defaults to 0 if omitted. The third is an exclusive end coordinate. If the end coordinate is omitted the token spans to the end of the segment. start coordinate and end coordinate can take positive or negative values to access the segment from either the 5’ (left) or 3’ (right) end and mimic the python array slicing syntax. The two colons are always mandatory.

The template configuration element controls what the output read will look like. In this case we asked it to have 4 segments. The first output segment is “0:8:”, or cycles 8 to end of input segment 0. Pheniqs can validate your config file and describe, in words, what the conversion from input to output will be. here is the description given by

pheniqs mux --config manipulate.json -si BDGGG.cram --validate

Output transform

    Output segment cardinality                  4

    Token No.0
        Length        variable
        Pattern       0:8:
        Description   cycles 8 to end of input segment 0

    Token No.1
        Length        variable
        Pattern       1::
        Description   cycles 0 to end of input segment 1

    Token No.2
        Length        8
        Pattern       0::8
        Description   cycles 0 to 8 of input segment 0

    Token No.3
        Length        variable
        Pattern       2::
        Description   cycles 0 to end of input segment 2

    Assembly instruction
        Append token 0 of input segment 0 to output segment 0
        Append token 1 of input segment 1 to output segment 1
        Append token 2 of input segment 0 to output segment 2
        Append token 3 of input segment 2 to output segment 3

Classifying sample barcodes

Now assume that the 8 nucleotides on the second out of three segments in the input read is actually the sample barcode. That would be the standard Illumina protocol for a single indexed pair-end run. We want to classify the reads by that barcode. If sequencers never made mistakes and reads were perfect that would be a very easy task since you would just need to compare two strings with 8 characters. In reality sequencers make mistakes, between 1 in 1000 and 1 in 10000 on most Illumina sequencers.

Most tools that classify barcodes ignore the quality score the sequencer provides and simply look for either a perfect string match or at most 1 mismatch, if the barcodes in your set are different enough from each other to tolerate the mismatch. This method is called minimum distance decoding, abbreviated MDD, and Pheniqs can do that if you insist. You just set the decoder property "algorithm": "mdd".

But Pheniqs can do something much smarter, it can compute the posterior probability of the correct barcode being decoded given the list of possible barcodes (see the paper for all the mathematical nitty gritty) and make decisions based on that. It can even report that probability for every read in a standard SAM auxiliary field. Think of it as a quality score for decoding the barcode. Except its not just a quality score, its an actual probability! if you had two, independent, barcodes (for instance a sample and cellular barcode) the probability that the entire classification is correct is simply the product of the two! We call this a Phred Adjusted Maximum Likelihood Decoder, abbreviated PAMLD.

PAMLD has much less false negatives, a LOT less than MDD. You can expect it to recover a lot of the reads that MDD is missing. How many depends on how confident you want Pheniqs to be, which you specify in the confidence threshold parameter. But PAMLD does another cool thing, it filters noise. It can tell if a read is more likely to be noise than an actual read. For instance a PhiX read that you spiked into the mixture with your pooled libraries or some human DNA that found its way into your sample (how did that get here?!). Benchmarks on simulated data tell us that it is VERY good at picking up noise.

Let’s look at a config file for decoding the sample barcode.

{
    "input": [ "BDGGG.cram", "BDGGG.cram", "BDGGG.cram" ],
    "template": {
        "transform": {
            "token": [
                "0::",
                "2::"
            ]
        }
    },
    "sample": {
        "algorithm": "pamld",
        "confidence threshold": 0.99,
        "noise": 0.01,
        "transform": {
            "token": [ "1::8" ]
        },
        "codec": {
            "@AGGCAGAA": {
                "LB": "trinidad 5",
                "barcode": [ "AGGCAGAA" ],
                "concentration": 1
            },
            "@CGTACTAG": {
                "LB": "trinidad 4",
                "barcode": [ "CGTACTAG" ],
                "concentration": 1
            },
            "@GGACTCCT": {
                "LB": "trinidad 9",
                "barcode": [ "GGACTCCT" ],
                "concentration": 1
            },
            "@TAAGGCGA": {
                "LB": "trinidad 1",
                "barcode": [ "TAAGGCGA" ],
                "concentration": 1
            },
            "@TCCTGAGC": {
                "LB": "trinidad 8",
                "barcode": [ "TCCTGAGC" ],
                "concentration": 1
            }
        }
    }
}

Since we don’t want the sample barcode to be part of the output read we skip that segment in the template transform, we only want the first and last segment in the output.

Next we declare a sample instruction in the config file. The transform in the sample instruction tells Pheniqs the barcode is on the first 8 nucleotides of the second input segment. The codec section is where you specify details on the barcodes you expect to find in your data. You naturally specify the barcode sequence, or multiple sequences if for instance you use dual indexing, but really you can have any number of segments in your barcodes, as long as their length and structure matches what you specified in the transform. Pheniqs has good error handling, if your config file has an error it will tell you exactly what went wrong.

For instance if in the first barcode you only specified 7 nucleotides:

    "@AGGCAGAA": {
        "LB": "trinidad 5",
        "barcode": [ "AGGCAGA" ],
        "concentration": 1
    },

Configuration error : sample decoder : expected 8 but found 7 nucleotides in segment 0 of barcode @AGGCAGAA

or if you copy/pasted your barcodes wrong and had the first one specified twice:

    "@AGGCAGAA": {
        "LB": "trinidad 5",
        "barcode": [ "CGTACTAG" ],
        "concentration": 1
    },
    "@CGTACTAG": {
        "LB": "trinidad 4",
        "barcode": [ "CGTACTAG" ],
        "concentration": 1
    },

Configuration error : sample decoder : duplicate barcode sequence CGTACTAG

We specified the input in the config file instead of the command line, so we don’t have to keep typing it. We also specified how to parse the input explicitly, instead of using the –sense-input, by specifying the interleaved input file 3 times, telling Pheniqs to pull three records for every read, or that it should expect to find 3 consecutive segments for every read in the file. being able to explicitly specify how to parse the input allows you to deal with malformed files (lets face it, bioinformatics isn’t perfect).

Notice we specified a 0.99 confidence threshold, so any barcode with a posterior decoding probability lower than that will be marked as failing quality control and moved to the unclassified (or undetermined) group. confidence threshold defaults to 0.95 but it represents a tradeoff between false positives and false negatives and the ideal value depends on your data and your barcode set, and frankly, what you want. We found 0.95 to be a fair tradeoff but if you prefer to loose more reads but be super confident about them you can set higher values and if you feel that your downstream analysis will be able to detect false positives and really want to recover as many as possible reads you can set lower values.

PAMLD’s Bayesian model depends on a set of prior probabilities, basically how much noise (the noise variable) and how much of each sample (the concentration variable in each barcode) you expect to find in your data. Noise defaults to 0.01, which is 1%, which is the amount of PhiX textbook Illumina runs have spiked in. But if you are doing anything other than a vanilla Illumina protocol you probably have more. if you leave out the concentration they all default to 1, essentially forming a uniform distribution. Pheniqs will normalize what ever you specify in concentration so that they all sum up to 1 – noise. For instance if you specify 1 in all your barcodes and 0.01 in noise and you had 5 samples they will each be assumed to have a 0.99 / 5 = 0.198 prior. What we have in the above config file could have just beed left out since its the default.

If you have a good guess you can specify your priors in the config. We will see later that Pheniqs can give you a pretty good estimate for them.

First lets see how the output looks

pheniqs mux --config decode.json |less

@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@RG     ID:AGGCAGAA     BC:AGGCAGAA     LB:trinidad 5   PU:AGGCAGAA
@RG     ID:CGTACTAG     BC:CGTACTAG     LB:trinidad 4   PU:CGTACTAG
@RG     ID:GGACTCCT     BC:GGACTCCT     LB:trinidad 9   PU:GGACTCCT
@RG     ID:TAAGGCGA     BC:TAAGGCGA     LB:trinidad 1   PU:TAAGGCGA
@RG     ID:TCCTGAGC     BC:TCCTGAGC     LB:trinidad 8   PU:TCCTGAGC
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --config decode.json     VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.0616e-06
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.0616e-06
M02455:162:000000000-BDGGG:1:1101:10005:7907    589     *       0       0       *       *       0       0       CCGTTTGAATGTTGACGGGATGAACATAATAAGCAATGACGGCAGCAATAA     CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGG     RG:Z:undetermined       BC:Z:ATGTTTAT   QT:Z:@----6,,
M02455:162:000000000-BDGGG:1:1101:10005:7907    653     *       0       0       *       *       0       0       GAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCG     CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGG     RG:Z:undetermined       BC:Z:ATGTTTAT   QT:Z:@----6,,

Pheniqs has declared a read group for every sample barcode and one for the undetermined. The BC and QT tags hold the values for the uncorrected sample barcode and the quality scores for it. RG is a reference to the Read Group ID as declared in the header. Those are all standard tags, you will find them in the SAM specifications. XB is the special Pheniqs sauce, it is the probability that the sample barcode decoding was incorrect, the error probability. In the second read (remember, this is an interleaved file, every two records are a read. You can see they have the same read ID) decoding failed. BC and QT are there but RG points to undetermined and there is no XB. If you will also notice the flag field is different since the bit marking this read as failing QC is on.

You can add Read Group metadata to your config file and it will end up where you expect it to. You already saw an example of that with the LB tag. for instance we add some global ones so they are added to all Read Groups.

{
    "CN": "CGSB",
    "DT": "2018-02-25T07:00:00+00:00",
    "PI": "300",
    "PL": "ILLUMINA",
    "PM": "miseq",
    "SM": "trinidad",
    "input": [ "BDGGG.cram", "BDGGG.cram", "BDGGG.cram" ],
    .
    .
    .
}

and we get a richer header

@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined CN:CGSB DT:2018-02-25T07:00:00+00:00    PI:300  PL:ILLUMINA     PM:miseq        PU:undetermined SM:trinidad
@RG     ID:AGGCAGAA     BC:AGGCAGAA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 5   PI:300  PL:ILLUMINA     PM:miseq        PU:AGGCAGAA     SM:trinidad
@RG     ID:CGTACTAG     BC:CGTACTAG     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 4   PI:300  PL:ILLUMINA     PM:miseq        PU:CGTACTAG     SM:trinidad
@RG     ID:GGACTCCT     BC:GGACTCCT     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 9   PI:300  PL:ILLUMINA     PM:miseq        PU:GGACTCCT     SM:trinidad
@RG     ID:TAAGGCGA     BC:TAAGGCGA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 1   PI:300  PL:ILLUMINA     PM:miseq        PU:TAAGGCGA     SM:trinidad
@RG     ID:TCCTGAGC     BC:TCCTGAGC     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 8   PI:300  PL:ILLUMINA     PM:miseq        PU:TCCTGAGC     SM:trinidad
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --config decode.json     VN:2.1.0

Estimating the priors

Suppose you have no idea what the priors are. You THINK you add the same amount of each pooled library but who know what actually happened. You also have no idea how much noise is there…

You make an initial Pheniqs run with your best guess, or with the defaults, and specify the --prior command line argument and provide a path to a new config file. Pheniqs will use the run to estimate the relative proportions of barcodes and noise in your data and write a new config file that contains the estimated priors. Since you don’t really want the actual output reads from this run you can save some time by using another nifty Pheniqs feature: you specify the output to be /dev/null basically telling Pheniqs you don’t want the output. That will save quite a bit of time since Pheniqs won’t waste time formatting the output or writing it anywhere.

pheniqs mux --config decode.json --prior decode_with_prior.json --output /dev/null

when its done running you will find a new config file at decode_with_prior.json.

{
    "CN": "CGSB",
    "DT": "2018-02-25T07:00:00+00:00",
    "PI": "300",
    "PL": "ILLUMINA",
    "PM": "miseq",
    "SM": "trinidad",
    "input": [
        "BDGGG.cram",
        "BDGGG.cram",
        "BDGGG.cram"
    ],
    "output": [
        "/dev/null"
    ],
    "prior adjusted job url": "decode_with_prior.json?format=json",
    "sample": {
        "algorithm": "pamld",
        "codec": {
            "@AGGCAGAA": {
                "LB": "trinidad 5",
                "barcode": [
                    "AGGCAGAA"
                ],
                "concentration": 0.186575935178905
            },
            "@CGTACTAG": {
                "LB": "trinidad 4",
                "barcode": [
                    "CGTACTAG"
                ],
                "concentration": 0.198520089969607
            },
            "@GGACTCCT": {
                "LB": "trinidad 9",
                "barcode": [
                    "GGACTCCT"
                ],
                "concentration": 0.211699846980038
            },
            "@TAAGGCGA": {
                "LB": "trinidad 1",
                "barcode": [
                    "TAAGGCGA"
                ],
                "concentration": 0.219937195111557
            },
            "@TCCTGAGC": {
                "LB": "trinidad 8",
                "barcode": [
                    "TCCTGAGC"
                ],
                "concentration": 0.168041901882987
            }
        },
        "confidence threshold": 0.99,
        "noise": 0.015225030876904,
        "transform": {
            "token": [
                "1::8"
            ]
        }
    },
    "template": {
        "transform": {
            "token": [
                "0::",
                "2::"
            ]
        }
    }
}

Notice the concentration and noise parameters have been adjusted to reflect what Pheniqs guessed from the initial run. The actual methodology for this prior estimate is described in the paper and in the manual. Notice the method is recursive since it uses your initial values as the base for the estimation. you can potentially run it a second time and get an even more accurate estimate but you also run the risk of over training your priors, but this is a different discussion… our benchmarks on synthetic data showed that a single estimation that started from a uniform distribution on data that was VERY non uniform still yielded a significant improvement in both precision and recall, and was just shy of using the real (and obviously unknown when your data is real and not simulated) priors.

one small minor detail is that since you executed Pheniqs with a null output your new config also specifies a null output. you can either edit the config or override it on the command line when you use this new config. pheniqs mux --config decode_with_prior.json -o -|less

@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined CN:CGSB DT:2018-02-25T07:00:00+00:00    PI:300  PL:ILLUMINA     PM:miseq        PU:undetermined SM:trinidad
@RG     ID:AGGCAGAA     BC:AGGCAGAA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 5   PI:300  PL:ILLUMINA     PM:miseq        PU:AGGCAGAA     SM:trinidad
@RG     ID:CGTACTAG     BC:CGTACTAG     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 4   PI:300  PL:ILLUMINA     PM:miseq        PU:CGTACTAG     SM:trinidad
@RG     ID:GGACTCCT     BC:GGACTCCT     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 9   PI:300  PL:ILLUMINA     PM:miseq        PU:GGACTCCT     SM:trinidad
@RG     ID:TAAGGCGA     BC:TAAGGCGA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 1   PI:300  PL:ILLUMINA     PM:miseq        PU:TAAGGCGA     SM:trinidad
@RG     ID:TCCTGAGC     BC:TCCTGAGC     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 8   PI:300  PL:ILLUMINA     PM:miseq        PU:TCCTGAGC     SM:trinidad
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --config decode_with_prior.json -o -     VN:2.1.0-6-gb006bb7a4990e86e62e4f0085e405662e88daef6
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.10298e-06
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.10298e-06

Splitting output

Interleaved files have many advantages: single input and single output allow you to use standard POSIX streams and avoid storing intermediate files. Its faster to read and write. It keeps experimental directory structures simpler and makes archives easier to browse. But sometimes you need to split your reads over multiple files, either by read segment or by barcode. Pheniqs can do that.

In the same fashion that the input directive allowed you to dictate multiple files to pull read segments from the output directive allows you to split the segments over multiple files.

Read segment Splitting

If your output read has multiple segments you can either specify just one output, in which case all segments will be interleaved into that one output. Or you can specify as many outputs as you have output segments and the segments will be written to the seperate files.

either on the command line

pheniqs mux --config decode.json --prior decode_with_prior.json --output first_segment.fastq.gz -o second_segment.bam

or in the config file

{
    "output": [
        "first_segment.fastq.gz",
        "second_segment.bam"
    ]
}

Each output file is evaluated independently so they don’t have to have the same format!

Barcode Splitting

You can also split reads into multiple files by barcodes. Either sample, cellular or molecular barcodes as long as you only split by one decoder. To do that you specify the output directive in the config section for each barcode.

{
    "CN": "CGSB",
    "DT": "2018-02-25T07:00:00+00:00",
    "PI": "300",
    "PL": "ILLUMINA",
    "PM": "miseq",
    "SM": "trinidad",
    "input": [
        "BDGGG.cram",
        "BDGGG.cram",
        "BDGGG.cram"
    ],
    "output": [
        "BDGGG_default.bam"
    ],
    "prior adjusted job url": "decode_with_prior.json?format=json",
    "sample": {
        "algorithm": "pamld",
        "codec": {
            "@AGGCAGAA": {
                "LB": "trinidad 5",
                "barcode": [
                    "AGGCAGAA"
                ],
                "concentration": 0.186575935178905,
                "output": [
                    "BDGGG_AGGCAGAA.bam"
                ]
            },
            "@CGTACTAG": {
                "LB": "trinidad 4",
                "barcode": [
                    "CGTACTAG"
                ],
                "concentration": 0.198520089969607,
                "output": [
                    "BDGGG_CGTACTAG_s01.fastq.gz",
                    "BDGGG_CGTACTAG_s02.fastq"
                ]
            },
            "@GGACTCCT": {
                "LB": "trinidad 9",
                "barcode": [
                    "GGACTCCT"
                ],
                "concentration": 0.211699846980038
            },
            "@TAAGGCGA": {
                "LB": "trinidad 1",
                "barcode": [
                    "TAAGGCGA"
                ],
                "concentration": 0.219937195111557
            },
            "@TCCTGAGC": {
                "LB": "trinidad 8",
                "barcode": [
                    "TCCTGAGC"
                ],
                "concentration": 0.168041901882987
            }
        },
        "confidence threshold": 0.99,
        "noise": 0.015225030876904,
        "transform": {
            "token": [
                "1::8"
            ]
        }
    },
    "template": {
        "transform": {
            "token": [
                "0::",
                "2::"
            ]
        }
    }
}

You can still set a global output and data from any barcode that has not declared its own output directive will be routed to the global output. You can have different layouts for each barcode. When we said Pheniqs is flexible, we meant it 😉

Building R Packages

Hands On

Load Required Packages

# devtools automates some of the setup for a new package
library(devtools)

# roxygen2 handles building the documentation
library(roxygen2)

Define Some Variables

# Define package directory
package_wd <- "/home/mk5636/badas-r-package"

# Define package name as string
package_name <- 'badas'

Create A Package Skeleton Using create()

dir.create(file.path(package_wd))
setwd(file.path(package_wd))

# The 'create' function from devtools creates the
# directory structures and skeleton help files
# for a new package
create(package_name)

Examine at the directory structure and DESCRIPTION file that was created by the create() function

Create a Function

Create an R function with a roxygen2-style header (for documentation). This will be incorporated into the package.

#' custom_add
#'
#' A custom function to add two numbers together
#'
#' @name custom_add
#' @param x The first number.
#' @param y The second number.
#' @return The result of adding the two numbers.
#' @export
#' @examples
#' result <- custom_add(1,2)
custom_add <- function(x,y){
    return(x+y)
}

Let’s test it out

custom_add(1, 2)

## [1] 3

Copy and paste the function into a file called {function_name}.R in {package_wd}/R/

Add data to a function

# Name the data
data_name <- 'mydata'

# Create the data object
data_obj <- c(1,2)

# Assign the data name to the data object
assign(x=data_name, value=data_obj)

# Create the data folder
data_path <- paste(package_wd, "/", package_name, "/data/" ,sep="")
dir.create(file.path(data_path))

# Write .rda file to data folder
data_file <- paste(data_path, data_name, ".rda", sep="")
save(list=data_name, file=data_file)

Create the documentation for the data

#' mydata
#'
#' A vector containing some test data.
#'
#' @format A vector with 2 values: c(1,2)
#' @source \\url{http://www.flybase.org}

Copy and paste the above into a file called {data_name}.R in {package_wd}/R/

Document, Build, and Install the Package

setwd(file.path(package_wd))

# The 'document' function from devtools is a wrapper for the
# roxygen2::roxygenize() function from the roxygen2 package
# It builds all documentation for a package
document(package_name)

# The 'build' function from devtools converts a
# package source directory into a single bundled file. 
# Returns the path to the built package
package_tar <- build(package_name)

# Let's install the package now
install.packages(package_tar, repos=NULL, type='source')

Quit your R session, start a new session, and try the following commands

This shouldn’t work

custom_add(1,2)

Error in custom_add(1, 2) : could not find function "custom_add"

This should work:

library(badas)

custom_add(1,2)

## [1] 3

See how the roxygen2 header we created for the function translates into documentation

?custom_add

Congratulations, you have successfully built an R package! Now you can share the tarball with collaborators to allow them to use your package. Next we will learn about how to submit packages to CRAN and how to install packages from GitHub.

Submitting to CRAN

Submitting to CRAN used to be very hard: creating documentation by hand, manually checking everything, etc. But with tools like devtools, roxygen2, etc. it has become much easier.

1) Might be a good idea to check the package name you want to use is not already taken. You can search the CRAN package list, or you can use the package available.

library(available)

available("available", browse = FALSE)

You can also use suggest to help you come up with a name, ex:

suggest("Convert Yeast IDs")

## [1] convertr

You probably want to do this before you build your package.

2) Run the check_built() function to ensure:

package can be built, installed, and uninstalled without issues
no potential conflicts, plays nice with other packages
required documentation present
correct formatting
no warnings
no notes

check_built(package_tar)

If there are notes, you need to explain them in the comments section when you submit your package.

3) Once you’re ready, submission is done through the web:
https://cran.r-project.org/submit.html

Upload the tarball, which must have been built using the build() function as we did, along with your name and email address, and comments optionally (here is where you would explain any notes from the check_built() function for example).

Please note: You should read the policies and ensure your package is compliant prior to submitting:
https://cran.r-project.org/web/packages/policies.html

Installing Packages from GitHub

library(devtools)
install_github("GreshamLab/labtools")

JBrowse: Visualizing Data Quickly & Easily

Watch the September 24, 2020 BADAS here.

Introduction

During the 2020 Summer of COVID-19, the Ghedin and Gresham labs sequenced SARS-CoV-2 isolates. To visualize and share this data with collaborators the web-based genome visualization software JBrowse was used https://jbrowse.bio.nyu.edu/covid-19.

To benefit all researchers at NYU engaged in genomics research, a centralized JBrowse service has been published at http://jbrowse.bio.nyu.edu/ for PIs and their lab members.

Demo

We’ll go through a demo of

Uploading a dataset
How to manipulate the dataset visualization
Some features
Sharing URLs

This dataset is located in /beegfs/eb167/yeast. You can see the standard files you are familiar with. There is one file specific to JBrowse and that is the samplelist text file. It is used to tell JBrowse how to group the files after upload, saving you the hassle of manually manipulating the configuration file.

To upload, run the following but only change the part in bold and replace with your netID.

cgsb_upload2jbrowse -p demo -d netID -f /beegfs/eb167/yeast

You will be prompted for your password twice in this process, but once you request access to a PI lab passwordless authentication will be configured for you.

Let me explain what this command does and the options available.

USAGE: cgsb_upload2jbrowse -p PI -d DATASET [-f FOLDER] [-s SAMPLELIST] [FILES] 
-------------------------------------------------------
-p | --PI                specify PI 
-d | --dataset           specify data set 
-f | --folder            specify folder containing files 
-s | --samplelist        specify sample list for categorization 
-------------------------------------------------------
File formats supported: 
- fa 
- fasta 
- fna 
- vcf.gz* 
- bam* 
- bam.bw 
- cram*  
- gff3.gz* 
 *Requires index file (tbi, bai, crai) of the same base name

Execution of this script will rsync or upload the specified files or folders to the appropriate JBrowse folder, create basic configurations for your dataset, and publish the site.

The required fields are -p/–PI for the PI you are a member of and -d/–dataset is for the dataset name that will be visible on the public site. Optional arguments are -f/–folder to recursively upload a folder and its contents as we just did and -s/–samplelist if the file is not named ‘samplelist’.

You’ll notice after running the command and if successfully completing you’ll be given a URL link. Something like: https://jbrowse.bio.nyu.edu/demo/?data=data/netID

Other Uploading Options:

Example 2 –

To transfer your data within your scratch (/scratch/user/project1/data) along with the reference data in the prince shared genome repository folders to Smith’s project1 data set run the following.

cgsb_upload2jbrowse -p Smith -d project1 \
 -f /scratch/user/project1/data  \   /scratch/work/cgsb/genomes/Public/Fungi/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa \    /scratch/work/cgsb/genomes/Public/Fungi/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Saccharomyces_cerevisiae.R64-1-1.34.gff3

Example 3 –

To transfer from outside the Prince cluster

# Transfer the files
rsync --progress -ruv /path/to/dataset/ \ 
     <NYUnetID>@jbrowse.bio.nyu.edu:/jbrowse/<PI>/<DATASET>
# Build and publish the tracks based on the files uploaded
ssh <NYUnetID>@jbrowse.bio.nyu.edu addTracks --PI <PI> --dataset <DATASET>

The data will be accessible immediately on the JBrowse server. Choose your PI on the JBrowse homepage’s dropdown menu then the data set name that was specified in the previous step. Once accessed you will be able to display visualizations or tracks for each file. These tracks by default will be named after the file itself.

Let’s go to the URL of our uploaded yeast dataset.

On the left we see two blocks labeled combined_ntr_21 and combined_ntr_22. This again was possible because we had a samplelist file listing the prefix used for grouping. If one is not provided it will look like this https://jbrowse.bio.nyu.edu/demo/?data=data/badas-noList.

The available tracks will be selectable on the left allowing you to display only items of interest and their order displayed.

If you go to the Track menu at the top of the page, you have two options to create a combination track combining 2 tracks or a sequence search track, which shows regions of the referenced sequence or its translations that match a DNA or amino acid sequence.

We can search for features of interest with the search bar at the top.

To the right of the search box we have the highlight button. We can highlight areas of note as well. This is beneficial as the URL will dynamically change based on the current view which includes the highlights.

You can then use this unique URL to share with colleagues or post in publications. Again this site is open to the public without need for VPN.

Customizing Tracks

We have configured the site to ingest your data in the JBrowse default fashion. You can edit the tracks configuration file (tracks.conf) to meet your needs. For more information on what is possible visit https://jbrowse.org/docs/installation.html.

Next Steps

Requesting Permissions

For access, request access through the biology department forms portal at https://forms.bio.nyu.edu. This is separate from HPC access and will require approval from the PI.

The primary reason being you will have ability to alter or delete all datasets associated with the PI. The data is not backed up. The only thing that is backed up nightly are the configuration files.

GATK Pipeline

We have integrated automated JBrowse visualization into the GATK pipeline, which performs alignment and variant calling. Within your nextflow.config file add the following lines to specify the data set name and the PI.

// JBrowse params
params.do_jbrowse = true
params.gff = "/scratch/work/cgsb/genomes/Public/Fungi/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Saccharomyces_cerevisiae.R64-1-1.34.gff3"
params.jbrowse_pi = "Smith"
params.dataset_name = "project1"

Intro to ggplot

Load packages

require(ggplot2)

How ggplot works

The basic structure of a plot

Let’s load in some data that we can use for plotting: the iris dataset, a built-in dataset in R that contains petal and sepal dimensions of various individuals from three different iris species.

print(iris)

Let’s use base R to plot the Sepal Length vs Sepal Width for all the data

plot(iris$Sepal.Length, iris$Sepal.Width)

The same plot in ggplot is a bit more complicated to put together:

ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point()

The above contains the components that are the bare minimum of what we need for a ggplot plot; we can add more on later, but let’s dissect the parts of this command:

ggplot(data = <DATA>, mapping = aes(<Mapping>)) +
        <GEOM_FUNCTION>()

arguments:
- data: the dataframe you want to plot
- mapping: Any variables from your data that affect plot output, listed in aes( )
commands:
- ggplot( ): required start of every ggplot command. Contains any options that we want to apply to the whole plot (which can be nothing)
- geom_{something}( ): how you’re plotting the data. Here, we want to plot points, so we’re using geom_point; there are tons of different geoms available, one for each type of plot you might want to make.

Arguments like data and mapping can go in the parentheses after the geom, producing the same plot as above:

ggplot() +
  geom_point(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width))

But there are specific situations in which it’s better to do this (we’ll see them later)

Modifying geom properties

We can also pass additional arguments to the geom: useful ones to know are:

color: line color; for the default shape used in geom_point, this actually colors the inside of the shape as well
fill: the fill color inside a shape
size: point size or line thickness
shape: for points, this is the shape; for lines, this is the line pattern or dashyness
alpha: transparency level, with 0 being totally transparent and 1 being a solid, opaque color

For example:

ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(color = 'blue', fill = 'yellow', shape = 23, alpha = 0.33, size = 5)

Why are some of these rhombuses darker than others? Note that any arguments that universally affect the properties of the points, lines, etc that we’re plotting, like the ones we used above, must be passed to the relevant geom, not to the ggplot( ) command. This is because the geom is in charge of making the points!

Mapping lots of variables

The plot we made above isn’t really all that useful. It’s great to see the data across all three species on one plot, but if we’re looking at this data, we’re probably actually interested in how these species differ from each other. So how do we make ggplot visually separate the points by species? Remember that the mapping argument deals with any properties of the plot that depend on variables in the supplied data frame. So we can modify our original code like this:

ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(alpha = 0.33)

# Can also be written as:
ggplot(data = iris) +
  geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species), alpha = 0.33)

Notice that the plot above uses both a variable-dependent color (based on the iris dataframe’s Species column), which goes inside aes( ), and a variable-independent alpha value that applies to the whole geom_point command and goes outside aes( ) Also, notice that you got a legend for free! You didn’t have to tell ggplot how to make it, or what info to include in it; it knows automatically based on how you set up your mapping. Depending on context, you can make color, fill, shape, size or alpha variable-dependent. Some of these (color, fill, shape) obviously make more sense for categorical variables, while others (alpha, size) make more sense for continuous variables, but ggplot will only rarely stop you from making aesthetically and data representationally questionable choices here. Let’s try an exercise: based on the code above, make a plot where Sepal.Length is on the x axis, Sepal.Width is on the y axis, all the points are colored red, the shape of the point depends on the Species, and the point size depends on Petal.Width Questionable usefulness, but hey, it’s possible and pretty easy…

Stacking multiple geoms

One of the places where ggplot really shines is when you want to combine multiple data representations on one plot. For example, I really like topology-style contour plots, which ggplot can make with geom_density2d. Once we know how to make a basic plot, and combining a contour plot with a plot the individual data points is super easy in ggplot:

# note, the first two lines are just our plot from above
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_density2d() +
  geom_point(alpha = 0.33)

Notice that the alpha argument we provided only applies to geom_point, so the contour lines don’t show any transparency. However, any arguments provided to mapping in an aes( ) statement in the ggplot( ) command apply across all geoms. (Also, notice that when we add a geom, ggplot automatically updates our legend!) One really powerful application of this is that we can actually make each geom( ) represent a different aspect of the same data. Let’s say we’d like our datapoints to be colored by species, but we’d also like to see a contour plot of sepal length vs width across all the species. To do this, we’re going to have to move our mapping calls inside the geoms, since we now want each geom to map the data differently:

# got rid of alpha here just to simplify things
ggplot(data = iris) +
  geom_density2d(aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(aes(x = Sepal.Length, y = Sepal.Width, color = Species))

This plot shows that mapping actually controls not just where to plot the data points and how they should look aesthetically, but also how the data is grouped when it’s represented in the plot. Notice that in the first contour plot, the statistics needed to plot the contours were computed separately for each species. However, when we removed species from the aes( ) being used by geom_density2d, the data was no longer separated by species for any of the stats calculated for this geom, and they’re instead calculated across all the points in the dataset.

Aside: ggplot objects

ggplot actually creates objects that we can store as variables and add onto. So, for example, we can do this:

basic_iris_plot <-
  ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point()
print(basic_iris_plot)

# let's add another geom to this plot
iris_plot_with_contours <-
  basic_iris_plot + geom_density2d()
print(iris_plot_with_contours)

themes and other options we can change

ggplot also allows a huge amount of control over other aspects of the plot (e.g. titles, axis labeling and scale, overall plot look, etc). For most of these, ggplot actually allows multiple equivalent ways to achieve the same effect.

axes + titles

Adding a title to a plot can be achieved using ggtitle()

basic_iris_plot +
  ggtitle('Iris Sepals')

We can also modify the axis properties directly

basic_iris_plot +
  ggtitle('Iris Sepals') +
  scale_x_continuous(name = 'Sepal Length',
                     limits = c(0,10)) +
  scale_y_log10(name = 'Sepal Width',
                breaks = c(2,3,4))

There’s a few things going on here:

scale_x_continuous( ) and scale_y_log10( ): set the scale of the x and y axes. For continuous variables, they can also be plotted on a square root scale, reversed, and various other transformations. For discrete variables, use scale_x_discrete( ) and scale_y_discrete( )
name: the axis label
limits: the bounds on the axis, must be provided as a 2-number vector
breaks: manually assign where the tickmarks go
labels: for discrete variables, this can be used to rename the categories along your axis

legend

You can modify the legend in a similar way to the other mappings (e.g. the axes); for example, if we want to modify the way the thing mapped to ‘color’ on our plot is represented, we can use scale_color_discrete( ), or, if we want to manually change the values assigned to each category (e.g. the colors), scale_color_manual( ):

basic_iris_plot +
  scale_color_manual(values=c("violet", "blue", "gray"),
                     name="Iris Species",
                     labels=c("Bristle-Pointed Iris", "Blue Flag", "Virginia Iris"))

We can also change the position of the legend using theme( ) (which can actually control nearly every other aesthetic aspect of the plot, such as font size, which axes get labels/tickmarks, etc).

basic_iris_plot +
  scale_color_manual(values=c("violet", "blue", "gray"),
                     name="Iris Species",
                     labels=c("Bristle-Pointed Iris", "Blue Flag", "Virginia Iris")) +
  theme(legend.position = 'bottom')

themes

Finally, the overall appearance of the graph can be changed by selecting a custom ‘theme’; this is a bit confusing, since these are distinct from the theme( ) command used above.

basic_iris_plot +
  scale_color_manual(values=c("violet", "blue", "gray"),
                     name="Iris Species",
                     labels=c("Bristle-Pointed Iris", "Blue Flag", "Virginia Iris")) +
  theme(legend.position = 'bottom') +
  theme_bw()

The structure of data that ggplot can plot

As you’ve seen, ggplot provides users with the power to easily change the appearance of the plot, and the statistics calculated, based on any single column in the dataframe containing the data to be plotted. But this also results in some pretty rigid rules about how your data needs to be organized. Namely, data for ggplot should be in tidy format:

each variable must have its own column
each observation must have its own row (but what’s an observation?)
each value must have its own cell

Let’s take a look at what that means. Compare the iris dataframe we’ve been using to the iris3 data, which comes with R and contains the same data:

iris_3_df <- data.frame(iris3)
print(iris_3_df)

Notice that now, there is a single row containing data on plants from each of the three iris species. This is not a completely crazy thing to do: maybe our experiment consisted of 50 individual plots, each of which had a plant from every species, and we collected the data on a plot-by-plot level. But organizing the data in this way makes directly graphing it with ggplot a real pain, at least if we want to compare species with each other. We no longer have a ‘species’ column, or neat columns for other mappings we might be interested in (e.g. Sepal Width). Here’s an attempt to make do with what we have:

ggplot(data = iris_3_df) +
  geom_point(mapping = aes(x = Sepal.L..Setosa, y = Sepal.W..Setosa), color = 'red') +
  geom_point(mapping = aes(x = Sepal.L..Versicolor, y = Sepal.W..Versicolor), color = 'blue') +
  geom_point(mapping = aes(x = Sepal.L..Virginica, y = Sepal.W..Virginica), color = 'green') +
  scale_x_continuous(name = 'Sepal Length') +
  scale_y_continuous(name = 'Sepal Width')

Some things still work well automatically (e.g. ggplot scales axes for us), but it’s a lot more effort to do this, and if we wanted to have a legend on this plot (or had more than a few categories we were interested in), it would be a complete nightmare. When putting together data to plot, we need to think very carefully about what exactly constitutes a single ‘observation’, and what the ‘variables’ are that we want to use for mapping. The tidyr package (which, like ggplot2, is part of the tidyverse package) has some really great functions for re-organizing data that looks like iris3 into a ‘tidy’ dataframe, and if you find yourself facing data that isn’t organized the right way for your plot, I really suggest looking over David Gresham’s tidyverse tutorial.

Why ggplot

easy exploratory data analysis: separation of variable mapping from visual representation (e.g. geoms) makes it easy to try different ways of plotting data
automation of the boring stuff: generation of legends, axis bounds, etc is done very well automatically based on the data and variables you’re plotting (but you can have finer control of it if you’d like)
automated stats: lots of geoms that calculate and display statistics of your data. These statistics are automatically calculated based on your grouping of the data, how you specify your axes (e.g. linear vs log scale), etc.
- geom_density and geom_density2d for density estimation
- geom_smooth for trendlines with error ribbons
- stat_summary for other ways to bin and summarize data)
plots as objects: storing plots as objects allows them to be easily modified and combined into figures
incredibly helpful online examples: the tidyverse website contains an incredible online manual with explanations and clear examples for nearly everything you might want to do in ggplot: https://ggplot2.tidyverse.org/reference/

Using ggplot for paper figures

Because ggplot does a great job of separating aesthetic properties of the plot from what is being plotted, we can create a theme that defines how our plots look in e.g. paper figures and apply it to all our plots after generating them. First, let’s set a theme for our plots. Because we want our figures to look nice and consistent for the paper, there’s a lot of options we can specify here.

final_figure_ggplot_theme <- 
  theme(plot.title=element_text(size=16,face='bold'),
        plot.margin=unit(c(12,12,12,12),'pt'),
        panel.background=element_rect(fill='white'),
        panel.grid.major=element_line(color='grey',size=0.3),
        axis.line = element_line(color="black", size = 0.5),
        legend.title=element_blank(),
        legend.justification=c(0,1),
        legend.key = element_rect(fill='white'),
        legend.key.height = unit(2,'line'),
        legend.text=element_text(size=12,face='bold'),
        axis.text.x=element_text(size=12,face='bold'),
        axis.text.y=element_text(size=12,face='bold'),
        axis.title.x=element_text(size=14,face='bold'),
        axis.title.y=element_text(size=14,face='bold',angle=90)) +
  theme_bw()

Next, let’s create some plots. We don’t have to worry about appearances here; let’s just make sure the data shows up the way we want it to.

# a figure with the iris data
sepal_points <-
  ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point()

sepal_trends <-
  ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_smooth(method = 'lm')

petal_boxplot <-
  ggplot(data = iris, aes(x = Species, y = Petal.Width, color = Species)) +
  geom_boxplot() +
  geom_point(position = 'jitter')

We can print these figures to look at them:

print(sepal_points)

print(sepal_trends)

print(petal_boxplot)

Useful, but not that neat. Now, let’s apply our figure theme:

sepal_points_figure <- sepal_points + final_figure_ggplot_theme
print(sepal_points_figure)

sepal_trends_figure <- sepal_trends + final_figure_ggplot_theme
print(sepal_trends_figure)

petal_boxplot_figure <- petal_boxplot + final_figure_ggplot_theme
print(petal_boxplot_figure)

There are a few ways to save the figure, but this is probably the easiest:

ggsave(file = '~/Documents/sepal_points_figure.pdf',
       plot = sepal_points_figure, width = 6.5, height = 4,
       useDingbats=FALSE)

(If saving as a pdf, useDingbats=FALSE is a must and will prevent a ggplot disaster from unfolding. If you load the cowplot package described below, it replaces ggsave with its own function that does this by default) Because we specified our font sizes in final_figure_ggplot_theme, they will be consistent across plots, regardless of the size we decide to save them at.

An example from the literature

eLife reproducible article

Other cool things

combining multiple data frames on one plot

ggplot makes it super easy to combine multiple datasets on one plot, assuming they have the relevant variables (dataframe columns) in common. Let’s break up the iris dataframe to see how this works:

iris_nonvirginca <- subset(iris, Species != 'virginica')
iris_virginica_petals <- subset(iris, Species == 'virginica')[, c('Petal.Width', 'Species')]
print(iris_nonvirginca)

print(iris_virginica_petals)

We now have two dataframes, containing data on different species, and with only a subset of the data in one that is contained in the other (the petal widths and species). But if petal width and species is what we want to plot, this isn’t a problem for ggplot:

ggplot() +
  geom_boxplot(data = iris_nonvirginca, aes(x = Species, y = Petal.Width, color = Species)) +
  geom_boxplot(data = iris_virginica_petals, aes(x = Species, y = Petal.Width, color = Species))

facets

Another great tool ggplot provides is faceting. This allows you to separate data into subplots based on a column (or multiple columns):

basic_iris_plot +
  facet_wrap( ~ Species)

Notice that the x-axes are consistent among these plots.

Lots of additional packages!

Because ggplot is so popular, there’s been a ton of additional packages written that build on top of it. Here are two examples.

gganimate

Add animations to plots

library(gganimate)
basic_iris_plot +
  transition_states(Species)

cowplot

Arrange your plots for publication figures

library(cowplot)

package ‘cowplot’ was built under R version 3.4.4
Attaching package: ‘cowplot’

The following object is masked from ‘package:ggplot2’:

    ggsave

plot_grid(sepal_points_figure, sepal_trends_figure, petal_boxplot_figure,
          labels = "AUTO")

library(cowplot)
top_row <- plot_grid(sepal_points_figure, labels = 'A')
bottom_row <- plot_grid(sepal_trends_figure, petal_boxplot_figure, labels = c('B', 'C'))
plot_grid(top_row, bottom_row, nrow = 2)

Presented by: Tobias Schraink and Stephen Kelly

March 28, 2019

Prerequiste

Prerequisite

Java 8 or later
Nextflow
Conda
Gitclone the repo into a directory in your linux environment

Nextflow hands-on

During this tutorial you will implement a proof of concept of a RNA-Seq pipeline which:

Indexes a trascriptome file.
Performs quality controls
Performs quantification.
Create a MultiqQC report.

Step 1 – define the pipeline parameters

The script script1.nf defines the pipeline input parameters. Run it by using the following command:

nextflow run script1.nf

Try to specify a different input parameter, for example:

nextflow run script1.nf --reads this/and/that

Exercise 1.1

Modify the script1.nf adding a fourth parameter named outdir and set it to a default path that will be used as the pipeline output directory.

Exercise 1.2

Modify the script1.nf to print all the pipeline parameters by using a single println command and a multiline stringstatement.

Tip: see an example here.

Recap

In this step you have learned:

How to define parameters in your pipeline script
How to pass parameters by using the command line
The use of $var and ${var} variable placeholders
How to use multiline strings

Step 2 – Create transcriptome index file

Nextflow allows the execution of any command or user script by using a process definition.

A process is defined by providing three main declarations: the process inputs, the process outputs and finally the command script.

The second example adds the index process. Open it to see how the process is defined.

It takes the transcriptome file as input and creates the transcriptome index by using the salmon tool.

Note how the input declaration defines a transcriptome variable in the process context that it is used in the command script to reference that file in the Salmon command line.

Try to run it by using the command:

nextflow run script2.nf

Exercise 2.1

Print the output of the index_ch channel by using the println operator (do not confuse it with the println statement seen previously).

Exercise 2.3

Use the command tree -a work to see how Nextflow organises the process work directory.

Recap

In this step you have learned:

How to define a process executing a custom command
How process inputs are declared
How process outputs are declared
How to access the number of available CPUs
How to print the content of a channel

Step 3 – Collect read files by pairs

This step shows how to match read files into pairs, so they can be mapped by Salmon.

Edit the script script3.nf and add the following statement as the last line:

read_pairs_ch.println()

Save it and execute it with the following command:

nextflow run script3.nf

It will print an output similar to the one shown below:

[ggal_gut, [/../data/ggal/gut_1.fq, /../data/ggal/gut_2.fq]]

The above example shows how the read_pairs_ch channel emits tuples composed by two elements, where the first is the read pair prefix and the second is a list representing the actual files.

Try it again specifying different read files by using a glob pattern:

nextflow run script3.nf --reads 'data/ggal/*_{1,2}.fq'

Exercise 3.1

Use the set operator in place of = assignment to define the read_pairs_ch channel.

Exercise 3.2

Use the ifEmpty operator to check if the read_pairs_ch contains at least an item.

Recap

In this step you have learned:

How to use fromFilePairs to handle read pair files
How to use the set operator to define a new channel variable
How to use the ifEmpty operator to check if a channel is empty

Step 4 – Perform expression quantification

The script script4.nf adds the quantification process.

In this script note as the index_ch channel, declared as output in the index process, is now used as a channel in the input section.

Also note as the second input is declared as a set composed by two elements: the pair_id and the reads in order to match the structure of the items emitted by the read_pairs_ch channel.

Execute it by using the following command:

nextflow run script4.nf

You will see the execution of a quantication process.

Execute it again adding the -resume option as shown below:

nextflow run script4.nf -resume

The -resume option skips the execution of any step that has been processed in a previous execution.

Try to execute it with more read files as shown below:

nextflow run script4.nf -resume --reads 'data/ggal/*_{1,2}.fq'

You will notice that the quantification process is executed more than one time.

Nextflow parallelizes the execution of your pipeline simply by providing multiple input data to your script.

Exercise 4.1

Add a tag directive to the quantification process to provide a more readable execution log .

Exercise 4.2

Add a publishDir directive to the quantification process to store the process results into a directory of your choice.

Recap

In this step you have learned:

How to connect two processes by using the channel declarations
How to resume the script execution skipping already already computed steps
How to use the tag directive to provide a more readable execution output
How to use the publishDir to store a process results in a path of your choice

Step 5 – Quality control

This step implements a quality control of your input reads. The inputs are the same read pairs which are provided to the quantification steps

You can run it by using the following command:

nextflow run script5.nf -resume

The script will report the following error message:

Channel `read_pairs_ch` has been used twice as an input by process `fastqc` and process `quantification`

Exercise 5.1

Modify the creation of the read_pairs_ch channel by using a into operator in place of a set.

Tip: see an example here.

Recap

In this step you have learned:

How to use the into operator to create multiple copies of the same channel

Step 6 – MultiQC report

This step collect the outputs from the quantification and fastqc steps to create a final report by using the MultiQC tool.

Execute the script with the following command:

nextflow run script6.nf -resume --reads 'data/ggal/*_{1,2}.fq'

It creates the final report in the results folder in the current work directory.

In this script note the use of the mix and collect operators chained together to get all the outputs of the quantificationand fastqc process as a single input.

Recap

In this step you have learned:

How to collect many outputs to a single input with the collect operator
How to mix two channels in a single channel
How to chain two or more operators togethers

Step 7 – Handle completion event

This step shows how to execute an action when the pipeline completes the execution.

Note that Nextflow processes define the execution of asynchronous tasks i.e. they are not executed one after another as they are written in the pipeline script as it would happen in a common imperative programming language.

The script uses the workflow.onComplete event handler to print a confirmation message when the script completes.

Try to run it by using the following command:

nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq'

Step 8 – Custom scripts

Real world pipelines use a lot of custom user scripts (BASH, R, Python, etc). Nextflow allows you to use and manage all these scripts in consistent manner. Simply put them in a directory named bin in the pipeline project root. They will be automatically added to the pipeline execution PATH.

For example, create a file named fastqc.sh with the following content:

#!/bin/bash
set -e
set -u

sample_id=${1}
reads=${2}

mkdir fastqc_${sample_id}_logs
fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}

Save it, grant the execute permission and move it in the bin directory as shown below:

chmod +x fastqc.sh
mkdir -p bin
mv fastqc.sh bin

Then, open the script7.nf file and replace the fastqc process’ script with the following code:

  script:
    """
    fastqc.sh "$sample_id" "$reads"
    """

Run it as before:

nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq'

Recap

In this step you have learned:

How to write or use existing custom script in your Nextflow pipeline.
How to avoid the use of absolute paths having your scripts in the bin/ project folder.

Step 9 – Executors

Real world genomic application can spawn the execution of thousands of jobs. In this scenario a batch scheduler is commonly used to deploy a pipeline in a computing cluster, allowing the execution of many jobs in parallel across many computing nodes.

Nextflow has built-in support for most common used batch schedulers such as Univa Grid Engine and SLURM between the others.

To run your pipeline with a batch scheduler modify the nextflow.config file specifying the target executor and the required computing resources if needed. For example:

process.executor = 'slurm'
process.queue = 'short'
process.memory = '10 GB'
process.time = '30 min'
process.cpus = 8

The above configuration specify the use of the SLURM batch scheduler to run the jobs spawned by your pipeline script. Then it specifies to use the short queue (partition), 10 gigabyte of memory and 8 cpus per job, and each job can run for no more than 30 minutes.

Note: the pipeline must be executed in a shared file system accessible to all the computing nodes.

Exercise 9.1

Print the head of the .command.run script generated by Nextflow in the task work directory and verify it contains the SLURM #SBATCH directives for the requested resources.

Exercise 9.2

Modify the configuration file to specify different resource request for the quantification process.

Tip: see the process documentation for an example.

Recap

In this step you have learned:

How to deploy a pipeline in a computing cluster.
How to specify different computing resources for different pipeline processes.

Step 10 – Use configuration profiles

The Nextflow configuration file can be organised in different profiles to allow the specification of separate settings depending on the target execution environment.

For the sake of this tutorial modify the nextflow.config as shown below:

profiles {
  standard {
    process.container = 'nextflow/rnaseq-nf'
    docker.enabled = true
  }

  cluster {
    process.executor = 'slurm'
    process.queue = 'short'
    process.memory = '10 GB'
    process.time = '30 min'
    process.cpus = 8
  }
}

The above configuration defines two profiles: standard and cluster. The name of the profile to use can be specified when running the pipeline script by using the -profile option. For example:

nextflow run script7.nf -profile cluster

The profile standard is used by default if no other profile is specified by the user.

Recap

In this step you have learned:

How to organise your pipeline configuration in separate profiles

Step 11 – Run a pipeline from a GitHub repository

Nextflow allows the execution of a pipeline project directly from a GitHub repository (or similar services eg. BitBucket and GitLab).

This simplifies the sharing and the deployment of complex projects and tracking changes in a consistent manner.

For the sake of this tutorial consider the example project published in the following URL:

https://github.com/nextflow-io/hello

You can run it by specifying the project name as shown below:

nextflow run nextflow-io/hello

It automatically downloads it and store in the $HOME/.nextflow folder.

Use the command info to show the project information, e.g.:

nextflow info nextflow-io/hello

Nextflow allows the execution of a specific revision of your project by using the -r command line option. For Example:

nextflow run -r v1.2 nextflow-io/hello

Revision are defined by using Git tags or branches defined in the project repository.

This allows a precise control of the changes in your project files and dependencies over time.

Step 12 (bonus) – Deposit your pipeline scripts in a GitHub repository

Create a new repository in your GitHub account and upload there the pipeline scripts of this tutorial. Then execute it specifying its name on the Nextflow command line.

Tip: see the documentation for further details.

Docker hands-on

Get practice with basic Docker commands to pull, run and build your own containers.

A container is a ready-to-run Linux environment which can be executed in an isolated manner from the hosting system. It has own copy of the file system, processes space, memory management, etc.

Containers are a Linux feature known as Control Groups or Ccgroups introduced with kernel 2.6.

Docker adds to this concept an handy management tool to build, run and share container images.

These images can be uploaded and published in a centralised repository know as Docker Hub, or hosted by other parties like for example Quay.

Step 1 – Run a container

Run a container is easy as using the following command:

docker run <container-name>

For example:

docker run hello-world

Step 2 – Pull a container

The pull command allows you to download a Docker image without running it. For example:

docker pull debian:wheezy

The above command download a Debian Linux image.

Step 3 – Run a container in interactive mode

Launching a BASH shell in the container allows you to operate in an interactive mode in the containerised operating system. For example:

docker run -it debian:wheezy bash

Once launched the container you wil noticed that’s running as root (!). Use the usual commands to navigate in the file system.

To exit from the container, stop the BASH session with the exit command.

Step 4 – Your first Dockerfile

Docker images are created by using a so called Dockerfile i.e. a simple text file containing a list of commands to be executed to assemble and configure the image with the software packages required.

In this step you will create a Docker image containing the Samtools tool.

Warning: the Docker build process automatically copies all files that are located in the current directory to the Docker daemon in order to create the image. This can take a lot of time when big/many files exist. For this reason it’s important to always work in a directory containing only the files you really need to include in your Docker image. Alternatively you can use the .dockerignore file to select the path to exclude from the build.

Then use your favourite editor eg. vim to create a file named Dockerfile and copy the following content:

FROM debian:wheezy

MAINTAINER <your name>

RUN apt-get update && apt-get install -y curl cowsay

ENV PATH=$PATH:/usr/games/

When done save the file.

Step 5 – Build the image

Build the Docker image by using the following command:

docker build -t my-image .

Note: don’t miss the dot in the above command. When it completes, verify that the image has been created listing all available images:

docker images

You can try your new container by running this command:

docker run my-image cowsay Hello Docker!

Step 6 – Add a software package to the image

Add the Salmon package to the Docker image by adding to the Dockerfile the following snippet:

RUN curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v0.8.2/Salmon-0.8.2_linux_x86_64.tar.gz | tar xz \
 && mv /Salmon-*/bin/* /usr/bin/ \
 && mv /Salmon-*/lib/* /usr/lib/

Save the file and build again the image with the same command as before:

docker build -t my-image .

You will notice that it creates a new Docker image with the same name but with a different image ID.

Step 7 – Run Salmon in the container

Check that everything is fine running Salmon in the container as shown below:

docker run my-image salmon --version

You can even launch a container in an interactive mode by using the following command:

docker run -it my-image bash

Use the exit command to terminate the interactive session.

Step 8 – File system mounts

Create an genome index file by running Salmon in the container.

Try to run Bowtie in the container with the following command:

docker run my-image \
  salmon index -t $PWD/data/ggal/transcriptome.fa -i index

The above command fails because Salmon cannot access the input file.

This happens because the container runs in a complete separate file system and it cannot access the hosting file system by default.

You will need to use the --volume command line option to mount the input file(s) eg.

docker run --volume $PWD/data/ggal/transcriptome.fa:/transcriptome.fa my-image \
  salmon index -t /transcriptome.fa -i index

An easier way is to mount a parent directory to an identical one in the container, this allows you to use the same path when running it in the container eg.

docker run --volume $HOME:$HOME --workdir $PWD my-image \
  salmon index -t $PWD/data/ggal/transcriptome.fa -i index

Step 9 – Upload the container in the Docker Hub (bonus)

Publish your container in the Docker Hub to share it with other people.

Create an account in the https://hub.docker.com web site. Then from your shell terminal run the following command, entering the user name and password you specified registering in the Hub:

docker login

Tag the image with your Docker user name account:

docker tag my-image <user-name>/my-image

Finally push it to the Docker Hub:

docker push <user-name>/my-image

After that anyone will be able to download it by using the command:

docker pull <user-name>/my-image

Note how after a pull and push operation, Docker prints the container digest number e.g.

Digest: sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266
Status: Downloaded newer image for nextflow/rnaseq-nf:latest

This is a unique and immutable identifier that can be used to reference container image in a univocally manner. For example:

docker pull nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266

Singularity

Singularity is container runtime designed to work in HPC data center, where the usage of Docker is generally not allowed due to security constraints.

Singularity implements the container execution model similarly to Docker however using a complete different implementation design.

A Singularity container image is archived in a plain file that can be stored in shared file system and accessed by many computing nodes managed by a batch scheduler.

Create a Singularity images

Singularity images are created using a Singularityfile in similar manner to Docker, though using a different syntax.

Bootstrap: docker
From: debian:wheezy

%environment
export PATH=$PATH:/usr/games/


%labels
AUTHOR <your name>

%post

apt-get update && apt-get install -y locales-all curl cowsay
curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v0.8.2/Salmon-0.8.2_linux_x86_64.tar.gz | tar xz \
 && mv /Salmon-*/bin/* /usr/bin/ \
 && mv /Salmon-*/lib/* /usr/lib/

Once you have save the Singularity file. Create the image with these commands:

singularity create my-image.img
sudo singularity bootstrap my-image.img Singularityfile

Singularity requires two commands to build an image, the creates the file image and allocate the required space on the storage. The second build the real container image.

Note: the bootstrap command requires sudo permissions.

Running a container

Once done, you can run your container with the following command

singularity exec my-image.img cowsay Hello Singularity

By using the shell command you can enter in the container in interactive mode. For example:

singularity shell my-image.img

Import an Docker image

An easier way to create Singularity container without requiring sudo permission and bootsting the containers interoperability is to import a Docker image container pulling it directly from a Docker registry. For example:

singularity pull docker://debian:wheezy

The above command automatically download the Debian Docker image and converts it to a Singularity image store in the current directory with the name debian-wheezy.img.

Run a Nextflow script using a Singularity container

Nextflow allows the transparent usage of Singularity containers as easy as with Docker ones.

It only requires to enable the use of Singularity engine in place of Docker in the Nextflow configuration file.

To run your previous script with Singularity add the following profile in the nextflow.config file in the $HOME/hack17-coursedirectory:

profiles {

  foo {
    process.container = 'docker://nextflow/rnaseq-nf'
    singularity.enabled = true
    singularity.cacheDir = "$PWD"
  }
}

The above configuration instructs nextflow to use Singularity engine to run your script processes. The container is pulled from the Docker registry and cached in the current directory to be used for further runs.

Try to run the script as shown below:

nextflow run script7.nf -profile foo

Note: Nextflow will pull the container image automatically, it will require a few seconds depending the network connection speed.

What is Pheniqs and why do you need it?

Installing with conda

Example files

Format conversions and interleaving

Read manipulation

Classifying sample barcodes

Estimating the priors

Splitting output

Read segment Splitting

Barcode Splitting

Hands On

Create a Function

Add data to a function

Document, Build, and Install the Package

Quit your R session, start a new session, and try the following commands

Submitting to CRAN

Installing Packages from GitHub

Introduction

Demo

Other Uploading Options:

Example 2 –

Example 3 –

Customizing Tracks

Next Steps

Requesting Permissions

GATK Pipeline

Eugene Plavskin

12/11/2019

Load packages

How ggplot works

The basic structure of a plot

Modifying geom properties

Mapping lots of variables

Stacking multiple geoms

Aside: ggplot objects

themes and other options we can change

axes + titles

legend

themes

The structure of data that ggplot can plot

Why ggplot

Using ggplot for paper figures

An example from the literature

Other cool things

combining multiple data frames on one plot

facets

Lots of additional packages!

gganimate

cowplot

What is Git?

What is GitHub?

Why Should I Use These?

Which Should I Use?

Follow Along!

Presented by: Tim Stuart (@timoast) and Andrew Butler (@andrewwbutler)

Dataset preprocessing

Integration of 3 pancreatic islet cell datasets

Cell type classification using an integrated reference

Presented by: Tobias Schraink and Stephen Kelly

Nextflow

Docker

Singularity

Prerequisite

Nextflow hands-on

Step 1 – define the pipeline parameters

Exercise 1.1

Exercise 1.2

Recap

Step 2 – Create transcriptome index file

Exercise 2.1

Exercise 2.3

Recap

Step 3 – Collect read files by pairs

Exercise 3.1

Exercise 3.2

Recap

Step 4 – Perform expression quantification

Exercise 4.1

Exercise 4.2

Recap