Overview

Scope and Features

pRESTO performs all stages of raw sequence processing prior to alignment against reference germline sequences. The toolkit is intended to be easy to use, but some familiarity with commandline applications is expected. Rather than providing a fixed solution to a small number of common workflows, we have designed pRESTO to be as flexible as possible. This design philosophy makes pRESTO suitable for many existing protocols and adaptable to future technologies, but requires users to construct a sequence of commands and options specific to their experimental protocol.

pRESTO is composed of a set of standalone tools to perform specific tasks, often with a series of subcommands providing different behaviors. A brief description of each tool is shown in the table below.

Tool	Subcommand	Description
AlignSets.py		Multiple aligns sets of sequences sharing the same annotation
	muscle	Uses the program MUSCLE to align reads
	offset	Uses a table of primer alignments to align the 5’ region
	table	Creates a table of primer alignments for the offset subcommand
AssemblePairs.py		Assembles paired-end reads into a complete sequence
	align	Assembles paired-end reads by aligning the sequence ends
	join	Concatenates pair-end reads with intervening gaps
	reference	Assembles paired-end reads using V-segment references
	sequential	Attempt alignment assembly followed by reference assembly
BuildConsensus.py		Constructs UMI consensus sequences
ClusterSets.py		Clusters read groups
	all	Cluster all sequences regardless of annotation
	barcode	Cluster reads by clustering barcode sequences
	set	Cluster reads by sequence data within barcode groups
CollapseSeq.py		Removes duplicate sequences
ConvertHeaders.py		Converts sequence headers to the pRESTO format
	454	Converts Roche 454 sequence headers
	genbank	Converts NCBI GenBank and RefSeq sequence headers
	generic	Converts sequence headers with an unknown annotation system
	illumina	Converts Illumina sequence headers
	imgt	Converts sequence headers output by IMGT/GENE-DB
	migec	Converts sequence headers output by MIGEC
	sra	Converts NCBI SRA or EMBL-EBI ENA sequence headers
EstimateError.py		Estimates error rates for UMI data
	barcode	Calculates pairwise distance metrics of barcode sequences
	set	Estimates error statistics within annotation sets
FilterSeq.py		Removes or modifies low quality reads
	length	Removes sequences under a defined length
	maskqual	Masks low Phred quality score positions with Ns
	missing	Removes sequences with a high number of Ns
	quality	Removes sequences with low Phred quality scores
	repeats	Removes sequences with long repeats of a single nucleotide
	trimqual	Trims sequences to segments with high Phred quality scores
MaskPrimers.py		Identifies and removes primer regions, MIDs and UMI barcodes
	align	Matches primers by local alignment and reorients sequences
	extract	Removes and annotates a fixed sequence region
	score	Matches primers at a fixed user-defined start position
PairSeq.py		Sorts paired-end reads and copies annotations between them
ParseHeaders.py		Manipulates sequence annotations
	add	Adds a field and value annotation pair to all reads
	collapse	Compresses a set of annotation fields into a single field
	copy	Copies values between annotations fields
	delete	Deletes an annotation from all reads
	expand	Expands an field with multiple values into separate annotations
	merge	Merge multiple annotations fields into a single field
	rename	Rename annotation fields
	table	Outputs sequence annotations as a data table
ParseLog.py		Converts the log output of pRESTO scripts into data tables
SplitSeq.py		Performs conversion, sorting, and subsetting of sequence files
	count	Splits files into smaller files
	group	Splits files based on numerical or categorical annotation
	sample	Randomly samples sequences from a file
	samplepair	Randomly samples paired-end reads from two files
	select	Filters sequences based on annotations
	sort	Sorts sequences based on annotations
UnifyHeaders		Unifies annotation fields based on grouping scheme
	consensus	Reassign fields to consensus values
	delete	Delete sequences with differing field values.

Input and Output

All tools take as input standard FASTA or FASTQ formatted files and output files in the same formats. This allows pRESTO to work seamlessly with other sequence processing tools that use either of these data formats; any steps within a pRESTO workflow can be exchanged for an alternate tool, if desired.

Each tool appends a specific suffix to its output files describing the step and output. For example, MaskPrimers will append _primers-pass to the output file containing successfully aligned sequences and _primers-fail to the file containing unaligned sequences.

Annotation Scheme

The majority of pRESTO tools manipulate and add sequences-specific annotations as part of their processing functions using the scheme shown below. Each annotation is delimited using a reserved character (| by default), with the annotation field name and values separated by a second reserved character (= by default), and each value within a field is separated by a third reserved character (, by default). These annotations follow the sequence identifier, which itself immediately follows the > (FASTA) or @ (FASTQ) symbol denoting the beginning of a new sequence entry. The sequence identifier is given the reserved field name ID. To mitigate potential analysis errors, each tool in pRESTO annotates sequences by appending values to existing annotation fields when they exist, and will not overwrite or delete annotations unless explicitly performed using the ParseHeaders tool. All reserved characters can be redefined using the command line options.

FASTA Annotation

>SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8
NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA

FASTQ Annotation

@SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8
NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA
+
!!!!nmoomllmlooj\Xlnngookkikloommononnoonnomnnlomononoojlmmkiklonooooooooomoo