Miscellaneous Tasks¶
Importing data from SRA, ENA or GenBank into pRESTO¶
If you have download a data set from GenBank, SRA or ENA the format of the sequences headers are different from the raw Roche 454 and Illumina header format. As such, they may or may not be compatible with pRESTO, depending on how the headers have been modified by the sequence archive. The ConvertHeaders allow you to change incompatible header formats into the pRESTO format. For example, to convert from SRA or ENA headers the sra subcommand would be used:
ConvertHeaders.py sra -s reads.fastq
ConvertHeaders provides the following conversion subcommands:
Subcommand | Formats Converted |
---|---|
generic | Headers with an unknown annotation system |
454 | Roche 454 |
genbank | NCBI GenBank and RefSeq |
illumina | Illumina HiSeq or MiSeq |
imgt | IMGT/GENE-DB |
sra | NCBI SRA or EMBL-EBI ENA |
Reducing file size for submission to IMGT/HighV-QUEST¶
IMGT/HighV-QUEST currently limits the size of uploaded files to 500,000 sequences. To accomodate this limit, you can use the count subcommand of SplitSeq to divide your files into small pieces.
SplitSeq.py count -s reads.fastq -n 500000 --fasta
The -n 500000
argument sets the maximum number of
sequences in each file and the --fasta
tells the tool to output a FASTA, rather than FASTQ, formatted file.
Note
You can usually avoid the necessity of reducing file sizes by removing duplicate sequences first using the CollapseSeq tool.
Subsetting sequence files by annotation¶
The group subcommand of SplitSeq allows you to split one file
into multiple files based on the values in a sequence annotation. For example,
splitting one file with multiple SAMPLE
annotations into separate files
(one for each sample) would be accomplished by:
SplitSeq.py group -s reads.fastq -f SAMPLE
Which will create a set of files labelled SAMPLE-M1
and SAMPLE-M2
, if samples are
named M1
and M2
.
If you wanted to split based on a numeric value, rather than a set of categorical values,
then you would add the --num
argument. SplitSeq
would then create two files: one containing sequences with values less than the threshold
specified by the --num
argument and one file containing
sequences with values greater than or equal to the threshold:
SplitSeq.py group -s reads.fastq -f DUPCOUNT --num 2
Which will create two files with the labels atleast-2
and under-2
.
Random sampling from sequence files¶
The sample subcommand of SplitSeq may be used to generate a
random sample from a sequence file or set of pair-end files. The example below
will select a random sample of 1,000 sequences (-n 1000
)
which all contain the annotation SAMPLE=M1
(-f SAMPLE
and -u M1
):
SplitSeq.py sample -s reads.fastq -f SAMPLE -u M1 -n 1000
Performing an analogous sampling of Illumina paired-end reads would be accomplished using the samplepair subcommand:
SplitSeq.py samplepair -s reads.fastq -f SAMPLE -u M1 -n 1000 --coord illumina
Cleaning or removing poor quality sequences¶
Data sets can be cleaned using one or more invocations of FilterSeq, which provides multiple sequence quality control operations. Four subcommands remove sequences from the data that fail to meet some threshold: including length, (length), number of N or gap characters (missing), homopolymeric tract length (repeats), or mean Phred quality score (quality). Two subcommands modify sequences without removing them: trimqual truncates the sequences when the mean Phred quality scores decays under a threshold, and maskqual replaces positions with low Phred quality scores with N characters.
FilterSeq provides the following quality control subcommands:
Subcommand | Operation |
---|---|
length | Removes short sequences |
missing | Removes sequences with too many Ns or gaps |
repeats | Removes sequences with long homopolymeric tracts |
quality | Removes sequences with low mean quality scores |
trimqual | Truncates sequences where quality scores decay |
maskqual | Masks low quality positions |
Assembling paired-end reads that do not overlap¶
The typical way to assemble paired-end reads is via de novo assembly using the align subcommand of AssemblePairs. However, some sequences with long CDR3 regions may fail to assemble due to insufficient, or completely absent, overlap between the mate-pairs. The reference or sequential subcommands can be used to assemble mate-pairs that do not overlap using the ungapped V-segment references sequences as a guide.
To handle such sequence in two separate steps, a normal align command
would be performed first. The --failed
argument is added so that the reads failing de novo alignment are output to
separate files:
AssemblePairs.py align -1 reads-1.fastq -2 reads-2.fastq --rc tail \
--coord illumina --failed -outname align
Then, the files labeled assemble-fail
, along with the ungapped V-segment
reference sequences (-r vref.fasta
),
would be input into the reference subcommand of AssemblePairs:
AssemblePairs.py reference -1 align-1_assemble-fail.fastq -2 align-2_assemble-fail.fastq \
--rc tail -r vref.fasta --coord illumina --outname ref
This will result in two separate assemble-pass
files - one from each step. You may
process them separately or concatenate them together into a single file:
cat align_assemble-pass.fastq ref_assemble-pass.fastq > merged_assemble-pass.fastq
However, if you intend to processes them together, you may simplify this by perform both steps using the sequential subcommand, which will attempt de novo assembly followed by reference guided assembly if de novo assembly fails:
AssemblePairs.py sequential -1 reads-1.fastq -2 reads-2.fastq --rc tail \
--coord illumina -r vref.fasta
Note
The sequences output by the reference or sequential subcommands may contain an appropriate length spacer of Ns between any mate-pairs that do not overlap. The :option:–fill <AssemblePairs reference –fill>`` argument may be specified to force AssemblePairs to insert the germline sequence into the missing positions, but this should be used with caution as the inserted sequence may not be biologically correct.
Assigning isotype annotations from the constant region sequence¶
MaskPrimers is usually used to remove primer regions and annotate sequences with primer identifiers. However, it can be used for any other case where you need to align a set of short sequences against the reads. One example alternate use is where you either do not know the C-region primer sequences or do not trust the primer region to provide an accurate isotype assignment.
If you build a FASTA file containing the reverse-complement of short sequences from the front of CH-1, then you can annotate the reads with these sequence in the same way you would C-region specific primers:
MaskPrimers.py align -s reads.fastq -p IGHC.fasta --maxlen 100 --maxerror 0.3 \
--mode cut --revpr --pf C_CALL
Where --revpr
tells MaskPrimers to
reverse-complement the “primer” sequences and look for them at the end of the reads,
--maxlen 100
restricts the search to the last
100 bp, --maxerror 0.3
allows for up to
30% mismatches, and -p IGHC.fasta
specifies the file
containing the CH-1 sequences. The name of the C-region will be added to the sequence
headers as the C_CALL
annotation, where the field name is specified by the
--pf
argument. An example CH-1 sequence file would look like:
>IGHD
CTGATATGATGGGGAACACATCCGGAGCCTTGGTGGGTGC
>IGHM
AGGAGACGAGGGGGAAAAGGGTTGGGGCGGATGCACTCCC
>IGHG
AGGGYGCCAGGGGGAAGACSGATGGGCCCTTGGTGGAAGC
>IGHA
MGAGGCTCAGCGGGAAGACCTTGGGGCTGGTCGGGGATGC
>IGHE
AGCGGGTCAAGGGGAAGACGGATGGGCTCTGTGTGGAGGC
See also
Constant region reference sequences may be downloaded from IMGT and the sequence headers can be reformated using the imgt subcommand of ConvertHeaders. Note, you may need to clean-up the reference sequences a bit before running ConvertHeaders if you receive an error about duplicate sequence names (eg, remove duplicate allele with different artificial splicing). To cut and reverse-complement the constant region sequences use something like seqmagick.
Estimating sequencing and PCR error rates with UMI data¶
The EstimateError tool provides methods for estimating the combined
PCR and sequencing error rates from large UMI read groups. The assumptions being,
that consensus sequences generated from sufficiently large UMI read groups should
be accurate representations of the true sequences, and that the rate of mismatches
from consensus should therefore be an accurate estimate of the error rate in
the data. However, this is not guaranteed to be true, hence this approach can only
be considered an estimate of a data set’s error profile. The following command
generates an error profile from UMI read groups with 50 or more sequences
(-n 50
), using a majority rule consensus sequence
(--mode freq
), and excluding UMI read groups
with high nucleotide diversity (--maxdiv 0.1
):
EstimateError.py -s reads.fastq -n 50 --mode freq --maxdiv 0.1
This generates the following tab-delimited files containing error rates broken down by various criteria:
File | Error profile |
---|---|
reads_error-position.tab | Error rates by read position |
reads_error-quality.tab | Error rates by quality score |
reads_error-nucleotide.tab | Error rates by nucleotide identity |
reads_error-set.tab | Error rates by UMI read group size |