What is Pheniqs and why do you need it?
Pheniqs is a fast and accurate high-throughput sequencing barcode classifier.- It can convert sequence reads to and from FASTQ, SAM, BAM and CRAM formats, decode barcodes and manipulate read layout in a single process.
- It is very accurate, much more than traditional minimum distance based methods that simply compare the number of mismatches between two sequences. More specifically is has lower False Discovery Rate and at the same time a much lower Miss Rate. In simple terms it misclassifies less reads, introduces less noise and at the same time recovers more reads.
- It is very fast. A benchmark of decoding 94 standard Illumina dual indexed barcodes in more than 11.5 billion reads on a CPU with 14 cores completed in just over 3 hours. deML for comparison, another statistical classifier that is still much faster than python implementations, took almost 80 hours to complete the same task.
- It can handle complex barcode designs with a flexible syntax that can address any number of barcodes anywhere on the read. That means no need for pre or post processing.
- It reports a decoding error probability for every barcode so you can take those metrics into account in your downstream analysis.
- It is actively developed and can be extended with new decoding algorithms to match new experimental designs. This is an open invitation for everyone to help us make it better by telling us what features you are missing.
- website: https://biosails.github.io/pheniqs/
- publication: https://rdcu.be/ctRng
- installing Pheniqs on the greene NYU cluster: https://gist.github.com/moonwatcher/e1e76978afc51560ff0a74c881705f36
Installing with conda
Pheniqs can be installed in many ways. You can install it with conda, you can use the pheniqs-build-api to build your own portable binary or you can build it from source. You can have a look here for details.But the good people at the hpc have created a module for us on greene so we will use this for this demo because it takes exactly two seconds to use it 🙂
module load pheniqs/2.1.0
Example files
In this tutorial we will be using a small toy example with 2500 reads. The reads are pair-ended single indexed and were produced by a MiSeq instrument: Two biological segments that are 51 nucleotides long and a single index that is 8 nucleotides long. You are probably not used to seeing Illumina output this way since samples are often demultiplexed by the core facility but the raw output from the sequencer includes each segment in a gzip compressed fastq file with full read quality data.You can get the three gzip compressed FASTQ files here: BDGGG_s01.fastq.gz, BDGGG_s02.fastq.gz, BDGGG_s03.fastq.gz.
Format conversions and interleaving
- Both input and output can be FASTQ or SAM/BAM/CRAM in either split or interleaved lacing.
- Input format is automatically detected from file content, not from the extension, so files without an extension or even with a wrong extension are fine (though still not good practice).
- Output format defaults to uncompressed SAM on stdout.
- Output format is determined by the output file extension: fastq, fastq.gz (gzip compressed FASTQ), sam, bam or cram.
- When output is –, or more explicitly /dev/stdout, you must specify the output format with a query parameter: -?format=fastq, -?format=fastq&compression=gz (gzip compressed FASTQ), -?format=sam (not usually necessery since this is the default), -?format=bam, -?format=cram.
- You can also control compression parameters in a similar fashion, see the manual for details.
- View a gzip compressed FASTQ file as SAM on stdout
pheniqs mux --input BDGGG_s01.fastq.gz|less
@HD VN:1.0 SO:unknown GO:query
@RG ID:undetermined PU:undetermined
@PG ID:pheniqs PN:pheniqs CL:pheniqs mux --input BDGGG_s01.fastq.gz VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630 76 * 0 0 * * 0 0 CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
Currently Pheniqs has only one sub command called mux
but there will be more soon. The only argument we use here is -i/--input
. Most command line arguments have both a short form and a long form. The short form is easier to use when typing interactively but the long form is better when documenting things because it makes the command more easy to read. Short form arguments can be aggregated, for instance -s -i input.fastq
is the same as -si input.fastq
. This is all just standard POSIX stuff which Pheniqs tries to adhere to as much as possible.- View a gzip compressed FASTQ file as FASTQ on stdout. Here you can see what you can expect to find in those 3 files.
pheniqs mux --input BDGGG_s01.fastq.gz --output "-?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
pheniqs mux --input BDGGG_s02.fastq.gz --output "-?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GGACTCCT
+
B@CCCFC<
pheniqs mux --input BDGGG_s01.fastq.gz --output "-?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
- Interleave the 3 gzip compressed FASTQ files into one interleaved uncompressed FASTQ file on stdout
pheniqs mux --input BDGGG_s01.fastq.gz --input
BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output
"/dev/stdout?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 2:N:0:
GGACTCCT
+
B@CCCFC<
@M02455:162:000000000-BDGGG:1:1101:10000:10630 3:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
-i/--input
can show up multiple times on the command line. Notice that the order matters…pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s03.fastq.gz -i BDGGG_s02.fastq.gz --output "/dev/stdout?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 2:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 3:N:0:
GGACTCCT
+
B@CCCFC<
- Interleave the 3 gzip compressed FASTQ files into one interleaved SAM output on stdout
pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output "/dev/stdout?format=sam"|less
@HD VN:1.0 SO:unknown GO:query
@RG ID:undetermined PU:undetermined
@PG ID:pheniqs PN:pheniqs CL:pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output /dev/stdout?format=sam VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630 77 * 0 0 * * 0 0 CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG FI:i:1 TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630 13 * 0 0 * * 0 0 GGACTCCT B@CCCFC< FI:i:2 TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630 141 * 0 0 * * 0 0 GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG FI:i:3 TC:i:3
- Ok, now lets interleave the 3 gzip compressed FASTQ files into one interleaved CRAM formatted file
pheniqs mux -i BDGGG_s01.fastq.gz -i BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output BDGGG.cram
When done, Pheniqs will give you a JSON formatted report on stderr. The report will provide statistics on both input and output as well as separate statistics for both overall reads and reads passing quality control (Pass Filter or PF reads). The documentation on the website describes what each variable means. You can also redirect the report to a file with -R/–report. Since no sample classification was requested all reads were directed to the “unclassified” read group.
{
"incoming": {
"count": 2500,
"pf count": 2430,
"pf fraction": 0.972
},
"sample": {
"classified count": 0,
"classified fraction": 0.0,
"classified pf fraction": 0.0,
"count": 2500,
"index": 0,
"pf classified count": 0,
"pf classified fraction": 0.0,
"pf count": 2430,
"pf fraction": 0.972,
"unclassified": {
"ID": "undetermined",
"PU": "undetermined",
"count": 2500,
"index": 0,
"pf count": 2430,
"pf fraction": 0.972,
"pf pooled fraction": 1.0,
"pooled fraction": 1.0
}
}
}
To inspect your CRAM file you can, again, use Pheniqs. Essentially this is a format conversion from CRAM to SAM… CRAM is the state-of-the-art compressed version of BAM. It is basically a super compressed SAM file. If you want to archive your data this can save you a LOT of space.pheniqs mux -i BDGGG.cram|less
@HD VN:1.0 SO:unknown GO:query
@RG ID:undetermined PU:undetermined
@PG ID:pheniqs PN:pheniqs CL:pheniqs mux -i BDGGG.cram VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630 76 * 0 0 * * 0 0 CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
M02455:162:000000000-BDGGG:1:1101:10000:10630 76 * 0 0 * * 0 0 GGACTCCT B@CCCFC<
M02455:162:000000000-BDGGG:1:1101:10000:10630 76 * 0 0 * * 0 0 GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
Or in an interleaved FASTQpheniqs mux -i BDGGG.cram -o "-?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GGACTCCT
+
B@CCCFC<
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
Oh wait! This is probably not what you expected?! Pheniqs thinks every one of the records in your file is a new read, when in fact every three records are 3 segments of the same read. To give you absolute control, Pheniqs by default, expects you to explicitly describe the interleaving nature of the input file. You can use the -s/–sense-input flag so ask Pheniqs to try and guess the interleaving nature of the file.pheniqs mux -si BDGGG.cram|less
@HD VN:1.0 SO:unknown GO:query
@RG ID:undetermined PU:undetermined
@PG ID:pheniqs PN:pheniqs CL:pheniqs mux -si BDGGG.cram VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630 77 * 0 0 * * 0 0 CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG FI:i:1 TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630 13 * 0 0 * * 0 0 GGACTCCT B@CCCFC< FI:i:2 TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630 141 * 0 0 * * 0 0 GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG FI:i:3 TC:i:3
Om this makes more sense… Here it is in interleaved FASTQpheniqs mux -si BDGGG.cram -o "-?format=fastq"|less
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 2:N:0:
GGACTCCT
+
B@CCCFC<
@M02455:162:000000000-BDGGG:1:1101:10000:10630 3:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
Here are the sizes of the files in the different formats.
size | type |
---|---|
959K | fastq |
177K | fastq.gz |
1.1M | sam |
177K | bam |
113K | cram |
Read manipulation
Pheniqs can also parse and manipulate the content of the read. It uses a syntax similar to python array slicing to tokenize every read segment and assemble new segments from the tokens. The syntax for manipulating the read structure is too complex to provide on a command line, you will need a JSON config file.Let’s start with a simple theoretical example. Suppose the first 8 nucleotides on the first segment are actually a cellular barcode and you want to separate them into a different segment. We create a configuration file that describes the layout:
{
"template": {
"transform": {
"token": [ "0:8:", "1:0:", "0:0:8", "2:0:" ]
}
}
}
pheniqs mux --config manipulate.json -si BDGGG.cram |less
M02455:162:000000000-BDGGG:1:1101:10000:10630 77 * 0 0 * * 0 0 TAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA FGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG FI:i:1 TC:i:4
M02455:162:000000000-BDGGG:1:1101:10000:10630 13 * 0 0 * * 0 0 GGACTCCT B@CCCFC< FI:i:2 TC:i:4
M02455:162:000000000-BDGGG:1:1101:10000:10630 13 * 0 0 * * 0 0 CTAAGAAA CCCCCGGG FI:i:3 TC:i:4
M02455:162:000000000-BDGGG:1:1101:10000:10630 141 * 0 0 * * 0 0 GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG FI:i:4 TC:i:4
We used the cram file as input since its faster to read and its also easier to feed Pheniqs with an interleaved file.Since Pheniqs follows the python array slicing syntax it uses a zero based coordinate system. That means counters start from zero and are inclusive of start coordinate and exclusive of end coordinate. It might sound complicated but it actually makes arithmatics very easy. for instance the first 3 nucleotides on the first read segment are “0:0:3”.
A token is made of 3 colon separated integers. The first is the mandatory zero based input segment index. The second is an inclusive start coordinate and defaults to 0 if omitted. The third is an exclusive end coordinate. If the end coordinate is omitted the token spans to the end of the segment. start coordinate and end coordinate can take positive or negative values to access the segment from either the 5’ (left) or 3’ (right) end and mimic the python array slicing syntax. The two colons are always mandatory.
The template configuration element controls what the output read will look like. In this case we asked it to have 4 segments. The first output segment is “0:8:”, or cycles 8 to end of input segment 0. Pheniqs can validate your config file and describe, in words, what the conversion from input to output will be. here is the description given by
pheniqs mux --config manipulate.json -si BDGGG.cram --validate
Output transform
Output segment cardinality 4
Token No.0
Length variable
Pattern 0:8:
Description cycles 8 to end of input segment 0
Token No.1
Length variable
Pattern 1::
Description cycles 0 to end of input segment 1
Token No.2
Length 8
Pattern 0::8
Description cycles 0 to 8 of input segment 0
Token No.3
Length variable
Pattern 2::
Description cycles 0 to end of input segment 2
Assembly instruction
Append token 0 of input segment 0 to output segment 0
Append token 1 of input segment 1 to output segment 1
Append token 2 of input segment 0 to output segment 2
Append token 3 of input segment 2 to output segment 3
Classifying sample barcodes
Now assume that the 8 nucleotides on the second out of three segments in the input read is actually the sample barcode. That would be the standard Illumina protocol for a single indexed pair-end run. We want to classify the reads by that barcode. If sequencers never made mistakes and reads were perfect that would be a very easy task since you would just need to compare two strings with 8 characters. In reality sequencers make mistakes, between 1 in 1000 and 1 in 10000 on most Illumina sequencers.Most tools that classify barcodes ignore the quality score the sequencer provides and simply look for either a perfect string match or at most 1 mismatch, if the barcodes in your set are different enough from each other to tolerate the mismatch. This method is called minimum distance decoding, abbreviated MDD, and Pheniqs can do that if you insist. You just set the decoder property
"algorithm": "mdd"
.But Pheniqs can do something much smarter, it can compute the posterior probability of the correct barcode being decoded given the list of possible barcodes (see the paper for all the mathematical nitty gritty) and make decisions based on that. It can even report that probability for every read in a standard SAM auxiliary field. Think of it as a quality score for decoding the barcode. Except its not just a quality score, its an actual probability! if you had two, independent, barcodes (for instance a sample and cellular barcode) the probability that the entire classification is correct is simply the product of the two! We call this a Phred Adjusted Maximum Likelihood Decoder, abbreviated PAMLD.
PAMLD has much less false negatives, a LOT less than MDD. You can expect it to recover a lot of the reads that MDD is missing. How many depends on how confident you want Pheniqs to be, which you specify in the confidence threshold parameter. But PAMLD does another cool thing, it filters noise. It can tell if a read is more likely to be noise than an actual read. For instance a PhiX read that you spiked into the mixture with your pooled libraries or some human DNA that found its way into your sample (how did that get here?!). Benchmarks on simulated data tell us that it is VERY good at picking up noise.
Let’s look at a config file for decoding the sample barcode.
{
"input": [ "BDGGG.cram", "BDGGG.cram", "BDGGG.cram" ],
"template": {
"transform": {
"token": [
"0::",
"2::"
]
}
},
"sample": {
"algorithm": "pamld",
"confidence threshold": 0.99,
"noise": 0.01,
"transform": {
"token": [ "1::8" ]
},
"codec": {
"@AGGCAGAA": {
"LB": "trinidad 5",
"barcode": [ "AGGCAGAA" ],
"concentration": 1
},
"@CGTACTAG": {
"LB": "trinidad 4",
"barcode": [ "CGTACTAG" ],
"concentration": 1
},
"@GGACTCCT": {
"LB": "trinidad 9",
"barcode": [ "GGACTCCT" ],
"concentration": 1
},
"@TAAGGCGA": {
"LB": "trinidad 1",
"barcode": [ "TAAGGCGA" ],
"concentration": 1
},
"@TCCTGAGC": {
"LB": "trinidad 8",
"barcode": [ "TCCTGAGC" ],
"concentration": 1
}
}
}
}
Since we don’t want the sample barcode to be part of the output read we skip that segment in the template transform, we only want the first and last segment in the output.Next we declare a sample instruction in the config file. The transform in the sample instruction tells Pheniqs the barcode is on the first 8 nucleotides of the second input segment. The codec section is where you specify details on the barcodes you expect to find in your data. You naturally specify the barcode sequence, or multiple sequences if for instance you use dual indexing, but really you can have any number of segments in your barcodes, as long as their length and structure matches what you specified in the transform. Pheniqs has good error handling, if your config file has an error it will tell you exactly what went wrong.
For instance if in the first barcode you only specified 7 nucleotides:
"@AGGCAGAA": {
"LB": "trinidad 5",
"barcode": [ "AGGCAGA" ],
"concentration": 1
},
Configuration error : sample decoder : expected 8 but found 7 nucleotides in segment 0 of barcode @AGGCAGAA
or if you copy/pasted your barcodes wrong and had the first one specified twice:
"@AGGCAGAA": {
"LB": "trinidad 5",
"barcode": [ "CGTACTAG" ],
"concentration": 1
},
"@CGTACTAG": {
"LB": "trinidad 4",
"barcode": [ "CGTACTAG" ],
"concentration": 1
},
Configuration error : sample decoder : duplicate barcode sequence CGTACTAG
We specified the input in the config file instead of the command line, so we don’t have to keep typing it. We also specified how to parse the input explicitly, instead of using the –sense-input, by specifying the interleaved input file 3 times, telling Pheniqs to pull three records for every read, or that it should expect to find 3 consecutive segments for every read in the file. being able to explicitly specify how to parse the input allows you to deal with malformed files (lets face it, bioinformatics isn’t perfect).Notice we specified a 0.99 confidence threshold, so any barcode with a posterior decoding probability lower than that will be marked as failing quality control and moved to the unclassified (or undetermined) group. confidence threshold defaults to 0.95 but it represents a tradeoff between false positives and false negatives and the ideal value depends on your data and your barcode set, and frankly, what you want. We found 0.95 to be a fair tradeoff but if you prefer to loose more reads but be super confident about them you can set higher values and if you feel that your downstream analysis will be able to detect false positives and really want to recover as many as possible reads you can set lower values.
PAMLD’s Bayesian model depends on a set of prior probabilities, basically how much noise (the noise variable) and how much of each sample (the concentration variable in each barcode) you expect to find in your data. Noise defaults to 0.01, which is 1%, which is the amount of PhiX textbook Illumina runs have spiked in. But if you are doing anything other than a vanilla Illumina protocol you probably have more. if you leave out the concentration they all default to 1, essentially forming a uniform distribution. Pheniqs will normalize what ever you specify in concentration so that they all sum up to 1 – noise. For instance if you specify 1 in all your barcodes and 0.01 in noise and you had 5 samples they will each be assumed to have a 0.99 / 5 = 0.198 prior. What we have in the above config file could have just beed left out since its the default.
If you have a good guess you can specify your priors in the config. We will see later that Pheniqs can give you a pretty good estimate for them.
First lets see how the output looks
pheniqs mux --config decode.json |less
@HD VN:1.0 SO:unknown GO:query
@RG ID:undetermined PU:undetermined
@RG ID:AGGCAGAA BC:AGGCAGAA LB:trinidad 5 PU:AGGCAGAA
@RG ID:CGTACTAG BC:CGTACTAG LB:trinidad 4 PU:CGTACTAG
@RG ID:GGACTCCT BC:GGACTCCT LB:trinidad 9 PU:GGACTCCT
@RG ID:TAAGGCGA BC:TAAGGCGA LB:trinidad 1 PU:TAAGGCGA
@RG ID:TCCTGAGC BC:TCCTGAGC LB:trinidad 8 PU:TCCTGAGC
@PG ID:pheniqs PN:pheniqs CL:pheniqs mux --config decode.json VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630 77 * 0 0 * * 0 0 CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG RG:Z:GGACTCCT BC:Z:GGACTCCT QT:Z:B@CCCFC< XB:f:1.0616e-06
M02455:162:000000000-BDGGG:1:1101:10000:10630 141 * 0 0 * * 0 0 GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG RG:Z:GGACTCCT BC:Z:GGACTCCT QT:Z:B@CCCFC< XB:f:1.0616e-06
M02455:162:000000000-BDGGG:1:1101:10005:7907 589 * 0 0 * * 0 0 CCGTTTGAATGTTGACGGGATGAACATAATAAGCAATGACGGCAGCAATAA CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGG RG:Z:undetermined BC:Z:ATGTTTAT QT:Z:@----6,,
M02455:162:000000000-BDGGG:1:1101:10005:7907 653 * 0 0 * * 0 0 GAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCG CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGG RG:Z:undetermined BC:Z:ATGTTTAT QT:Z:@----6,,
Pheniqs has declared a read group for every sample barcode and one for the undetermined. The BC and QT tags hold the values for the uncorrected sample barcode and the quality scores for it. RG is a reference to the Read Group ID as declared in the header. Those are all standard tags, you will find them in the SAM specifications. XB is the special Pheniqs sauce, it is the probability that the sample barcode decoding was incorrect, the error probability. In the second read (remember, this is an interleaved file, every two records are a read. You can see they have the same read ID) decoding failed. BC and QT are there but RG points to undetermined and there is no XB. If you will also notice the flag field is different since the bit marking this read as failing QC is on.You can add Read Group metadata to your config file and it will end up where you expect it to. You already saw an example of that with the LB tag. for instance we add some global ones so they are added to all Read Groups.
{
"CN": "CGSB",
"DT": "2018-02-25T07:00:00+00:00",
"PI": "300",
"PL": "ILLUMINA",
"PM": "miseq",
"SM": "trinidad",
"input": [ "BDGGG.cram", "BDGGG.cram", "BDGGG.cram" ],
.
.
.
}
and we get a richer header
@HD VN:1.0 SO:unknown GO:query
@RG ID:undetermined CN:CGSB DT:2018-02-25T07:00:00+00:00 PI:300 PL:ILLUMINA PM:miseq PU:undetermined SM:trinidad
@RG ID:AGGCAGAA BC:AGGCAGAA CN:CGSB DT:2018-02-25T07:00:00+00:00 LB:trinidad 5 PI:300 PL:ILLUMINA PM:miseq PU:AGGCAGAA SM:trinidad
@RG ID:CGTACTAG BC:CGTACTAG CN:CGSB DT:2018-02-25T07:00:00+00:00 LB:trinidad 4 PI:300 PL:ILLUMINA PM:miseq PU:CGTACTAG SM:trinidad
@RG ID:GGACTCCT BC:GGACTCCT CN:CGSB DT:2018-02-25T07:00:00+00:00 LB:trinidad 9 PI:300 PL:ILLUMINA PM:miseq PU:GGACTCCT SM:trinidad
@RG ID:TAAGGCGA BC:TAAGGCGA CN:CGSB DT:2018-02-25T07:00:00+00:00 LB:trinidad 1 PI:300 PL:ILLUMINA PM:miseq PU:TAAGGCGA SM:trinidad
@RG ID:TCCTGAGC BC:TCCTGAGC CN:CGSB DT:2018-02-25T07:00:00+00:00 LB:trinidad 8 PI:300 PL:ILLUMINA PM:miseq PU:TCCTGAGC SM:trinidad
@PG ID:pheniqs PN:pheniqs CL:pheniqs mux --config decode.json VN:2.1.0
Estimating the priors
Suppose you have no idea what the priors are. You THINK you add the same amount of each pooled library but who know what actually happened. You also have no idea how much noise is there…You make an initial Pheniqs run with your best guess, or with the defaults, and specify the
--prior
command line argument and provide a path to a new config file. Pheniqs will use the run to estimate the relative proportions of barcodes and noise in your data and write a new config file that contains the estimated priors. Since you don’t really want the actual output reads from this run you can save some time by using another nifty Pheniqs feature: you specify the output to be /dev/null
basically telling Pheniqs you don’t want the output. That will save quite a bit of time since Pheniqs won’t waste time formatting the output or writing it anywhere.pheniqs mux --config decode.json --prior decode_with_prior.json --output /dev/null
when its done running you will find a new config file at decode_with_prior.json
.
{
"CN": "CGSB",
"DT": "2018-02-25T07:00:00+00:00",
"PI": "300",
"PL": "ILLUMINA",
"PM": "miseq",
"SM": "trinidad",
"input": [
"BDGGG.cram",
"BDGGG.cram",
"BDGGG.cram"
],
"output": [
"/dev/null"
],
"prior adjusted job url": "decode_with_prior.json?format=json",
"sample": {
"algorithm": "pamld",
"codec": {
"@AGGCAGAA": {
"LB": "trinidad 5",
"barcode": [
"AGGCAGAA"
],
"concentration": 0.186575935178905
},
"@CGTACTAG": {
"LB": "trinidad 4",
"barcode": [
"CGTACTAG"
],
"concentration": 0.198520089969607
},
"@GGACTCCT": {
"LB": "trinidad 9",
"barcode": [
"GGACTCCT"
],
"concentration": 0.211699846980038
},
"@TAAGGCGA": {
"LB": "trinidad 1",
"barcode": [
"TAAGGCGA"
],
"concentration": 0.219937195111557
},
"@TCCTGAGC": {
"LB": "trinidad 8",
"barcode": [
"TCCTGAGC"
],
"concentration": 0.168041901882987
}
},
"confidence threshold": 0.99,
"noise": 0.015225030876904,
"transform": {
"token": [
"1::8"
]
}
},
"template": {
"transform": {
"token": [
"0::",
"2::"
]
}
}
}
Notice the concentration
and noise
parameters have been adjusted to reflect what Pheniqs guessed from the initial run. The actual methodology for this prior estimate is described in the paper and in the manual. Notice the method is recursive since it uses your initial values as the base for the estimation. you can potentially run it a second time and get an even more accurate estimate but you also run the risk of over training your priors, but this is a different discussion… our benchmarks on synthetic data showed that a single estimation that started from a uniform distribution on data that was VERY non uniform still yielded a significant improvement in both precision and recall, and was just shy of using the real (and obviously unknown when your data is real and not simulated) priors.one small minor detail is that since you executed Pheniqs with a null output your new config also specifies a null output. you can either edit the config or override it on the command line when you use this new config.
pheniqs mux --config decode_with_prior.json -o -|less
@HD VN:1.0 SO:unknown GO:query
@RG ID:undetermined CN:CGSB DT:2018-02-25T07:00:00+00:00 PI:300 PL:ILLUMINA PM:miseq PU:undetermined SM:trinidad
@RG ID:AGGCAGAA BC:AGGCAGAA CN:CGSB DT:2018-02-25T07:00:00+00:00 LB:trinidad 5 PI:300 PL:ILLUMINA PM:miseq PU:AGGCAGAA SM:trinidad
@RG ID:CGTACTAG BC:CGTACTAG CN:CGSB DT:2018-02-25T07:00:00+00:00 LB:trinidad 4 PI:300 PL:ILLUMINA PM:miseq PU:CGTACTAG SM:trinidad
@RG ID:GGACTCCT BC:GGACTCCT CN:CGSB DT:2018-02-25T07:00:00+00:00 LB:trinidad 9 PI:300 PL:ILLUMINA PM:miseq PU:GGACTCCT SM:trinidad
@RG ID:TAAGGCGA BC:TAAGGCGA CN:CGSB DT:2018-02-25T07:00:00+00:00 LB:trinidad 1 PI:300 PL:ILLUMINA PM:miseq PU:TAAGGCGA SM:trinidad
@RG ID:TCCTGAGC BC:TCCTGAGC CN:CGSB DT:2018-02-25T07:00:00+00:00 LB:trinidad 8 PI:300 PL:ILLUMINA PM:miseq PU:TCCTGAGC SM:trinidad
@PG ID:pheniqs PN:pheniqs CL:pheniqs mux --config decode_with_prior.json -o - VN:2.1.0-6-gb006bb7a4990e86e62e4f0085e405662e88daef6
M02455:162:000000000-BDGGG:1:1101:10000:10630 77 * 0 0 * * 0 0 CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG RG:Z:GGACTCCT BC:Z:GGACTCCT QT:Z:B@CCCFC< XB:f:1.10298e-06
M02455:162:000000000-BDGGG:1:1101:10000:10630 141 * 0 0 * * 0 0 GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG RG:Z:GGACTCCT BC:Z:GGACTCCT QT:Z:B@CCCFC< XB:f:1.10298e-06
Splitting output
Interleaved files have many advantages: single input and single output allow you to use standard POSIX streams and avoid storing intermediate files. Its faster to read and write. It keeps experimental directory structures simpler and makes archives easier to browse. But sometimes you need to split your reads over multiple files, either by read segment or by barcode. Pheniqs can do that.In the same fashion that the input directive allowed you to dictate multiple files to pull read segments from the output directive allows you to split the segments over multiple files.
Read segment Splitting
If your output read has multiple segments you can either specify just one output, in which case all segments will be interleaved into that one output. Or you can specify as many outputs as you have output segments and the segments will be written to the seperate files.either on the command line
pheniqs mux --config decode.json --prior decode_with_prior.json --output first_segment.fastq.gz -o second_segment.bam
or in the config file
{
"output": [
"first_segment.fastq.gz",
"second_segment.bam"
]
}
Each output file is evaluated independently so they don’t have to have the same format!
Barcode Splitting
You can also split reads into multiple files by barcodes. Either sample, cellular or molecular barcodes as long as you only split by one decoder. To do that you specify the output directive in the config section for each barcode.{
"CN": "CGSB",
"DT": "2018-02-25T07:00:00+00:00",
"PI": "300",
"PL": "ILLUMINA",
"PM": "miseq",
"SM": "trinidad",
"input": [
"BDGGG.cram",
"BDGGG.cram",
"BDGGG.cram"
],
"output": [
"BDGGG_default.bam"
],
"prior adjusted job url": "decode_with_prior.json?format=json",
"sample": {
"algorithm": "pamld",
"codec": {
"@AGGCAGAA": {
"LB": "trinidad 5",
"barcode": [
"AGGCAGAA"
],
"concentration": 0.186575935178905,
"output": [
"BDGGG_AGGCAGAA.bam"
]
},
"@CGTACTAG": {
"LB": "trinidad 4",
"barcode": [
"CGTACTAG"
],
"concentration": 0.198520089969607,
"output": [
"BDGGG_CGTACTAG_s01.fastq.gz",
"BDGGG_CGTACTAG_s02.fastq"
]
},
"@GGACTCCT": {
"LB": "trinidad 9",
"barcode": [
"GGACTCCT"
],
"concentration": 0.211699846980038
},
"@TAAGGCGA": {
"LB": "trinidad 1",
"barcode": [
"TAAGGCGA"
],
"concentration": 0.219937195111557
},
"@TCCTGAGC": {
"LB": "trinidad 8",
"barcode": [
"TCCTGAGC"
],
"concentration": 0.168041901882987
}
},
"confidence threshold": 0.99,
"noise": 0.015225030876904,
"transform": {
"token": [
"1::8"
]
}
},
"template": {
"transform": {
"token": [
"0::",
"2::"
]
}
}
}
You can still set a global output and data from any barcode that has
not declared its own output directive will be routed to the global
output. You can have different layouts for each barcode. When we said
Pheniqs is flexible, we meant it 😉