Overview
Scope and Features
pRESTO performs all stages of raw sequence processing prior to alignment against reference germline sequences. The toolkit is intended to be easy to use, but some familiarity with commandline applications is expected. Rather than providing a fixed solution to a small number of common workflows, we have designed pRESTO to be as flexible as possible. This design philosophy makes pRESTO suitable for many existing protocols and adaptable to future technologies, but requires users to construct a sequence of commands and options specific to their experimental protocol.
pRESTO is composed of a set of standalone tools to perform specific tasks, often with a series of subcommands providing different behaviors. A brief description of each tool is shown in the table below.
Tool |
Subcommand |
Description |
---|---|---|
Multiple aligns sets of sequences sharing the same annotation |
||
muscle |
Uses the program MUSCLE to align reads |
|
offset |
Uses a table of primer alignments to align the 5’ region |
|
table |
Creates a table of primer alignments for the offset subcommand |
|
Assembles paired-end reads into a complete sequence |
||
align |
Assembles paired-end reads by aligning the sequence ends |
|
join |
Concatenates pair-end reads with intervening gaps |
|
reference |
Assembles paired-end reads using V-segment references |
|
sequential |
Attempt alignment assembly followed by reference assembly |
|
Constructs UMI consensus sequences |
||
Clusters read groups |
||
all |
Cluster all sequences regardless of annotation |
|
barcode |
Cluster reads by clustering barcode sequences |
|
set |
Cluster reads by sequence data within barcode groups |
|
Removes duplicate sequences |
||
Converts sequence headers to the pRESTO format |
||
454 |
Converts Roche 454 sequence headers |
|
genbank |
Converts NCBI GenBank and RefSeq sequence headers |
|
generic |
Converts sequence headers with an unknown annotation system |
|
illumina |
Converts Illumina sequence headers |
|
imgt |
Converts sequence headers output by IMGT/GENE-DB |
|
migec |
Converts sequence headers output by MIGEC |
|
sra |
Converts NCBI SRA or EMBL-EBI ENA sequence headers |
|
Estimates error rates for UMI data |
||
barcode |
Calculates pairwise distance metrics of barcode sequences |
|
set |
Estimates error statistics within annotation sets |
|
Removes or modifies low quality reads |
||
length |
Removes sequences under a defined length |
|
maskqual |
Masks low Phred quality score positions with Ns |
|
missing |
Removes sequences with a high number of Ns |
|
quality |
Removes sequences with low Phred quality scores |
|
repeats |
Removes sequences with long repeats of a single nucleotide |
|
trimqual |
Trims sequences to segments with high Phred quality scores |
|
Identifies and removes primer regions, MIDs and UMI barcodes |
||
align |
Matches primers by local alignment and reorients sequences |
|
extract |
Removes and annotates a fixed sequence region |
|
score |
Matches primers at a fixed user-defined start position |
|
Sorts paired-end reads and copies annotations between them |
||
Manipulates sequence annotations |
||
add |
Adds a field and value annotation pair to all reads |
|
collapse |
Compresses a set of annotation fields into a single field |
|
copy |
Copies values between annotations fields |
|
delete |
Deletes an annotation from all reads |
|
expand |
Expands an field with multiple values into separate annotations |
|
merge |
Merge multiple annotations fields into a single field |
|
rename |
Rename annotation fields |
|
table |
Outputs sequence annotations as a data table |
|
Converts the log output of pRESTO scripts into data tables |
||
Performs conversion, sorting, and subsetting of sequence files |
||
count |
Splits files into smaller files |
|
group |
Splits files based on numerical or categorical annotation |
|
sample |
Randomly samples sequences from a file |
|
samplepair |
Randomly samples paired-end reads from two files |
|
select |
Filters sequences based on annotations |
|
sort |
Sorts sequences based on annotations |
|
Unifies annotation fields based on grouping scheme |
||
consensus |
Reassign fields to consensus values |
|
delete |
Delete sequences with differing field values. |
Input and Output
All tools take as input standard FASTA or FASTQ formatted files and output files in the same formats. This allows pRESTO to work seamlessly with other sequence processing tools that use either of these data formats; any steps within a pRESTO workflow can be exchanged for an alternate tool, if desired.
Each tool appends a specific suffix to its output files describing the step and
output. For example, MaskPrimers will append _primers-pass
to the output
file containing successfully aligned sequences and _primers-fail
to the file
containing unaligned sequences.
See also
Details regarding the suffixes used by pRESTO tools can be found in the Commandline Usage documentation for each tool.
Annotation Scheme
The majority of pRESTO tools manipulate and add sequences-specific annotations
as part of their processing functions using the scheme shown below. Each
annotation is delimited using a reserved character (|
by default), with the
annotation field name and values separated by a second reserved character
(=
by default), and each value within a field is separated by a third
reserved character (,
by default). These annotations follow the sequence
identifier, which itself immediately follows the >
(FASTA) or @
(FASTQ)
symbol denoting the beginning of a new sequence entry. The sequence identifier
is given the reserved field name ID
. To mitigate potential analysis
errors, each tool in pRESTO annotates sequences by appending values to existing
annotation fields when they exist, and will not overwrite or delete annotations
unless explicitly performed using the ParseHeaders tool. All reserved characters
can be redefined using the command line options.
>SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8
NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA
@SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8
NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA
+
!!!!nmoomllmlooj\Xlnngookkikloommononnoonnomnnlomononoojlmmkiklonooooooooomoo
See also
Details regarding the annotations added by pRESTO tools can be found in the Commandline Usage documentation for each tool.
The ParseHeaders.py tool provides a number of options for manipulating annotations in the pRESTO format.
The ConvertHeaders.py tool allows you convert several common annotation schemes into the pRESTO annotation format.