Pheniqs

Pheniqs (pronounced “phoenix”) is a flexible, high-performance command-line tool for manipulating and classifying high-throughput sequencing reads. Written in multithreaded C++11 with direct HTSlib integration, it reads and writes FASTQ, SAM, BAM, and CRAM, converting between formats on the fly and handling interleaved or split segment layouts in any combination. Even without invoking any barcode processing, Pheniqs serves as a fast general-purpose utility for interleaving split read segments into a single compressed container, converting between file formats, restructuring read layouts, and piping reformatted data through POSIX streams to tools that require a specific input format. A Python-array-slicing-inspired tokenization syntax lets users extract, rearrange, and reverse-complement arbitrary subsequences from any read segment, assembling new output reads from the pieces. All of this is controlled by a JSON configuration file with inheritance, imports, and reusable definitions, or by simple command-line arguments for trivial operations.

When barcode classification is needed, Pheniqs natively supports sample and cellular barcodes at arbitrary positions along the read, as well as user-defined barcode types (split-pool tags, antibody tags, spatial barcodes, etc.), all without pre- or post-processing. Its core decoder, the Phred-adjusted maximum likelihood decoder (PAMLD), computes the full Bayesian posterior probability of each barcode assignment by combining per-base quality scores with prior distributions over barcode frequencies and an explicit noise model. A noise filter rejects reads whose best-match likelihood is no better than random, and a tunable confidence threshold (defaulting to 0.95) flags low-confidence assignments as QC-fail while preserving the classification for downstream reconsideration. Priors can be user-supplied, left uniform, or estimated from the data in a two-pass approach. Decoded sequences, quality scores, and per-read error probabilities are written to standardized SAM auxiliary fields, and for combinatorial barcodes the overall confidence is the product of the individual posteriors. On semi-synthetic benchmarks, PAMLD consistently outperforms both minimum distance decoding and simpler maximum likelihood methods, with the largest gains for low-abundance barcodes and high error rates.

A consumer/producer threading model with double-buffered I/O allows performance to scale nearly linearly with core count, processing billions of reads per hour on modern hardware while using minimal memory. Pheniqs is available via Bioconda or source compilation and is free for academic use under the NYU non-commercial research license.