Running phold
phold has been split into two overall modules regardless of your input - predict and compare.
The reason phold has been split as such is to enable optimal resource usage between GPU and CPU in cluster environments.
If you have a local workstation with GPU, please use phold run, which combines both in one.
If you have a local workstation without a GPU, please use phold run with --cpu.
If you have trouble running ProstT5, please use phold remote.
If you are having troubing runing phold offline after installation (e.g. on a HPC), you may need to add TRANSFORMERS_OFFLINE=True to your environment.
Input
Most subcommands of phold takes as their input an entry GenBank formatted file that contains the output of pharokka for your phage or phage contigs. This will be called pharokka.gbk by default in your pharokka output. phold will also accept bakta GenBank or NCBI RefSeq GenBank format input.
Alternatively, phold will detect if the input is a FASTA contig/genome file as input. If so, Pyrodigal-gv will then be run to quickly predict the CDS and these will be annotated. However, neither tRNAs, tmRNA nor CRISPR repeats will be predicted (unlike in pharokka) for now.
For phold proteins-predict and phold proteins-compare, the input will be a FASTA format file containing amino acid protein sequences. These commands are useful for annotating bulk phage proteins.
Subcommands
phold predict
predict uses the ProstT5 protein language model to translate protein amino acid sequences to the 3Di token alphabet used by foldseek. This module is greatly accelerated if you have a GPU available and is recommended.
Usage: phold predict [OPTIONS]
Uses ProstT5 to predict 3Di tokens - GPU recommended
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Path to input file in Genbank format or
nucleotide FASTA format [required]
-o, --output PATH Output directory [default: output_phold]
-t, --threads INTEGER Number of threads [default: 1]
-p, --prefix TEXT Prefix for output files [default: phold]
-d, --database TEXT Specific path to installed phold database
-f, --force Force overwrites the output directory
--autotune Run autotuning to detect and automatically
use best batch size for your hardware.
Recommended only if you have a large dataset
(e.g. thousands of proteins), or else
autotuning will add rather than save runtime.
--batch_size INTEGER batch size for ProstT5. [default: 1]
--cpu Use cpus only.
--omit_probs Do not output per residue 3Di probabilities
from ProstT5. Mean per protein 3Di
probabilities will always be output.
--save_per_residue_embeddings Save the ProstT5 embeddings per resuide in a
h5 file
--save_per_protein_embeddings Save the ProstT5 embeddings as means per
protein in a h5 file
--mask_threshold FLOAT Masks 3Di residues below this value of
ProstT5 confidence for Foldseek searches
[default: 25]
--finetune Use gbouras13/ProstT5Phold encoder + CNN
model both finetuned on phage proteins
--vanilla Use vanilla CNN model (trained on CASP14)
with ProstT5Phold encoder instead of the one
trained on phage proteins
--hyps Use this to only annotate hypothetical
proteins from a Pharokka GenBank input
Example usage (assuming you have run phold install)
phold predict -i pharokka.gbk -o phold_predict_output
phold compare
phold compare runs foldseek to compare ProstT5 predictions generated by phold predict to the phold database.
Alternatively, if you have provided pre-generated .pdb format protein structures for you proteins, you can specify those by specifiying --structures --structure_dir <directory>.
phold compare does not use a GPU by default. However, if you have one available, you can utilise Foldseek-GPU acceleration using --foldseek_gpu. Note that you need to make sure your also run phold install with --foldseek_gpu prior. Regardless of whether you use --foldseek_gpu or not, it is recommended to use as many CPU threads with -t as you can (as the GPU only accelerates Foldseek's prefilter, not the alignment step).
Example usage of phold compare following phold predict
phold compare -i pharokka.gbk -o phold_compare_output --predictions_dir phold_predict_output
Example usage if you have .pdb or .cif format structures available for your phage proteins (note: the .pdb file names must be called cds_id.pdb, where cds_id is the CDS ids output from Pharokka). You can see an example in the tests/test_data/NC_043029_pdbs directory here.
phold compare -i pharokka.gbk -o phold_compare_output_pdb --pdb --pdb_dir directory_with_pdbs -t 8
If you have a custom database of protein structures would would additionally like to search against, Phold will support this using --custom_db, with the
You will first need to use foldseek createdb first to create a Foldseek compatible database. For example, assuming you have protein structures in a directory called custom_structures
foldseek createdb custom_structures/ mycustomdb
phold compare -i pharokka.gbk -o phold_compare_output_with_custom_db --custom_db mycustomdb -t 8
Usage: phold compare [OPTIONS]
Runs Foldseek vs phold db
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Path to input file in Genbank format or
nucleotide FASTA format [required]
--predictions_dir PATH Path to output directory from phold predict
--structures Use if you have .pdb or .cif file structures
for the input proteins (e.g. with
AF2/Colabfold .pdb or AF3 for .cif) in a
directory that you specify with
--structure_dir
--structure_dir PATH Path to directory with .pdb or .cif file
structures. The CDS IDs need to be in the name
of the file
--filter_structures Flag that creates a copy of the .pdb or .cif
files structures with matching record IDs
found in the input GenBank file. Helpful if
you have a directory with lots of .pdb files
and want to annotate only e.g. 1 phage.
-o, --output PATH Output directory [default: output_phold]
-t, --threads INTEGER Number of threads [default: 1]
-p, --prefix TEXT Prefix for output files [default: phold]
-d, --database TEXT Specific path to installed phold database
-f, --force Force overwrites the output directory
-e, --evalue FLOAT Evalue threshold for Foldseek [default: 1e-3]
-s, --sensitivity FLOAT Sensitivity parameter for foldseek [default:
9.5]
--keep_tmp_files Keep temporary intermediate files,
particularly the large foldseek_results.tsv of
all Foldseek hits
--card_vfdb_evalue FLOAT Stricter E-value threshold for Foldseek CARD
and VFDB hits [default: 1e-10]
--separate Output separate GenBank files for each contig
--max_seqs INTEGER Maximum results per query sequence allowed to
pass the prefilter. You may want to reduce
this to save disk space for enormous datasets
[default: 1000]
--ultra_sensitive Runs phold with maximum sensitivity by
skipping Foldseek prefilter. Not recommended
for large datasets.
--extra_foldseek_params TEXT Extra foldseek search params
--custom_db TEXT Path to custom database
--foldseek_gpu Use this to enable compatibility with
Foldseek-GPU search acceleration
--restart Use this to restart phold from 'Processing
Foldseek output' after foldseek_results.tsv is
generated
phold run
phold run runs phold predict and phold compare together in one command. Recommended if you are running phold on a local workstation with an available GPU. If you have an NVIDIA GPU, --foldseek_gpu is recommended to accelerate Foldseek.
Also recommended if you are running phold in a low-resource environment without a GPU e.g. with a laptop - you will also need to specify --cpu.
Example usage where NVIDIA GPU is available:
phold run -i pharokka.gbk -o phold_output -t 8 --foldseek_gpu
Usage: phold run [OPTIONS]
phold predict then comapare all in one - GPU recommended
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Path to input file in Genbank format or
nucleotide FASTA format [required]
-o, --output PATH Output directory [default: output_phold]
-t, --threads INTEGER Number of threads [default: 1]
-p, --prefix TEXT Prefix for output files [default: phold]
-d, --database TEXT Specific path to installed phold database
-f, --force Force overwrites the output directory
--autotune Run autotuning to detect and automatically
use best batch size for your hardware.
Recommended only if you have a large dataset
(e.g. thousands of proteins), or else
autotuning will add rather than save runtime.
--batch_size INTEGER batch size for ProstT5. [default: 1]
--cpu Use cpus only.
--omit_probs Do not output per residue 3Di probabilities
from ProstT5. Mean per protein 3Di
probabilities will always be output.
--save_per_residue_embeddings Save the ProstT5 embeddings per resuide in a
h5 file
--save_per_protein_embeddings Save the ProstT5 embeddings as means per
protein in a h5 file
--mask_threshold FLOAT Masks 3Di residues below this value of
ProstT5 confidence for Foldseek searches
[default: 25]
--finetune Use gbouras13/ProstT5Phold encoder + CNN
model both finetuned on phage proteins
--vanilla Use vanilla CNN model (trained on CASP14)
with ProstT5Phold encoder instead of the one
trained on phage proteins
--hyps Use this to only annotate hypothetical
proteins from a Pharokka GenBank input
-e, --evalue FLOAT Evalue threshold for Foldseek [default:
1e-3]
-s, --sensitivity FLOAT Sensitivity parameter for foldseek [default:
9.5]
--keep_tmp_files Keep temporary intermediate files,
particularly the large foldseek_results.tsv
of all Foldseek hits
--card_vfdb_evalue FLOAT Stricter E-value threshold for Foldseek CARD
and VFDB hits [default: 1e-10]
--separate Output separate GenBank files for each contig
--max_seqs INTEGER Maximum results per query sequence allowed to
pass the prefilter. You may want to reduce
this to save disk space for enormous datasets
[default: 1000]
--ultra_sensitive Runs phold with maximum sensitivity by
skipping Foldseek prefilter. Not recommended
for large datasets.
--extra_foldseek_params TEXT Extra foldseek search params
--custom_db TEXT Path to custom database
--foldseek_gpu Use this to enable compatibility with
Foldseek-GPU search acceleration
--restart Use this to restart phold from 'Processing
Foldseek output' after foldseek_results.tsv
is generated
phold proteins-predict
Identical to phold predict, but instead takes a FASTA input file of amino acid protein sequences. Useful for bulk annotation of phage proteins.
Example usage
phold proteins-predict -i phage_proteins.faa -o phold_proteins_predict_output
Usage: phold proteins-predict [OPTIONS]
Runs ProstT5 on a multiFASTA input - GPU recommended
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Path to input multiFASTA file [required]
-o, --output PATH Output directory [default: output_phold]
-t, --threads INTEGER Number of threads [default: 1]
-p, --prefix TEXT Prefix for output files [default: phold]
-d, --database TEXT Specific path to installed phold database
-f, --force Force overwrites the output directory
--autotune Run autotuning to detect and automatically
use best batch size for your hardware.
Recommended only if you have a large dataset
(e.g. thousands of proteins), or else
autotuning will add rather than save runtime.
--batch_size INTEGER batch size for ProstT5. [default: 1]
--cpu Use cpus only.
--omit_probs Do not output per residue 3Di probabilities
from ProstT5. Mean per protein 3Di
probabilities will always be output.
--save_per_residue_embeddings Save the ProstT5 embeddings per resuide in a
h5 file
--save_per_protein_embeddings Save the ProstT5 embeddings as means per
protein in a h5 file
--mask_threshold FLOAT Masks 3Di residues below this value of
ProstT5 confidence for Foldseek searches
[default: 25]
--finetune Use gbouras13/ProstT5Phold encoder + CNN
model both finetuned on phage proteins
--vanilla Use vanilla CNN model (trained on CASP14)
with ProstT5Phold encoder instead of the one
trained on phage proteins
--hyps Use this to only annotate hypothetical
proteins from a Pharokka GenBank input
phold proteins-compare
Identical to phold compare, but instead takes a FASTA input file of amino acid protein sequences. Useful for bulk annotation of phage proteins.
Example usage
phold proteins-compare -i phage_proteins.faa --predictions_dir phold_proteins_predict_output -o phold_proteins_compare_output -t 8
Usage: phold proteins-compare [OPTIONS]
Runs Foldseek vs phold db on proteins input
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Path to input file in multiFASTA format
[required]
--predictions_dir PATH Path to output directory from phold proteins-
predict
--structures Use if you have .pdb or .cif file structures
for the input proteins (e.g. with
AF2/Colabfold) in a directory that you specify
with --structure_dir
--structure_dir PATH Path to directory with .pdb or .cif file
structures. The CDS IDs need to be in the name
of the file
--filter_structures Flag that creates a copy of the .pdb or .cif
files structures with matching record IDs
found in the input GenBank file. Helpful if
you have a directory with lots of .pdb files
and want to annotate only e.g. 1 phage.
-o, --output PATH Output directory [default: output_phold]
-t, --threads INTEGER Number of threads [default: 1]
-p, --prefix TEXT Prefix for output files [default: phold]
-d, --database TEXT Specific path to installed phold database
-f, --force Force overwrites the output directory
-e, --evalue FLOAT Evalue threshold for Foldseek [default: 1e-3]
-s, --sensitivity FLOAT Sensitivity parameter for foldseek [default:
9.5]
--keep_tmp_files Keep temporary intermediate files,
particularly the large foldseek_results.tsv of
all Foldseek hits
--card_vfdb_evalue FLOAT Stricter E-value threshold for Foldseek CARD
and VFDB hits [default: 1e-10]
--separate Output separate GenBank files for each contig
--max_seqs INTEGER Maximum results per query sequence allowed to
pass the prefilter. You may want to reduce
this to save disk space for enormous datasets
[default: 1000]
--ultra_sensitive Runs phold with maximum sensitivity by
skipping Foldseek prefilter. Not recommended
for large datasets.
--extra_foldseek_params TEXT Extra foldseek search params
--custom_db TEXT Path to custom database
--foldseek_gpu Use this to enable compatibility with
Foldseek-GPU search acceleration
--restart Use this to restart phold from 'Processing
Foldseek output' after foldseek_results.tsv is
generated
phold remote
This command queries the Foldseek webserver to predict the 3Di sequence instead of running ProstT5 locally, followed by a local Foldseek search against the Phold database locally. This is recommended for users with extremely low compute (such as an old laptop) or who can't get ProstT5 to run on their machine.
I would recommend our Colab notebook instead to be honest.
Example usage
phold remote -i pharokka.gbk -o phold_remote_output -t 8
Usage: phold remote [OPTIONS]
Uses Foldseek API to run ProstT5 then Foldseek locally
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Path to input file in Genbank format or
nucleotide FASTA format [required]
-o, --output PATH Output directory [default: output_phold]
-t, --threads INTEGER Number of threads [default: 1]
-p, --prefix TEXT Prefix for output files [default: phold]
-d, --database TEXT Specific path to installed phold database
-f, --force Force overwrites the output directory
-e, --evalue FLOAT Evalue threshold for Foldseek [default: 1e-3]
-s, --sensitivity FLOAT Sensitivity parameter for foldseek [default:
9.5]
--keep_tmp_files Keep temporary intermediate files,
particularly the large foldseek_results.tsv of
all Foldseek hits
--card_vfdb_evalue FLOAT Stricter E-value threshold for Foldseek CARD
and VFDB hits [default: 1e-10]
--separate Output separate GenBank files for each contig
--max_seqs INTEGER Maximum results per query sequence allowed to
pass the prefilter. You may want to reduce
this to save disk space for enormous datasets
[default: 1000]
--ultra_sensitive Runs phold with maximum sensitivity by
skipping Foldseek prefilter. Not recommended
for large datasets.
--extra_foldseek_params TEXT Extra foldseek search params
--custom_db TEXT Path to custom database
--foldseek_gpu Use this to enable compatibility with
Foldseek-GPU search acceleration
phold createdb
This in an auxillary command that allows you to create a Foldseek compatible database from AA and 3Di protein sequences (such as those created by phold predict).
Example usage
phold createdb --fasta_aa phold_aa.fasta --fasta_3di phold_3di.fasta -o my_foldseek_db
Usage: phold createdb [OPTIONS]
Creates foldseek DB from AA FASTA and 3Di FASTA input files
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
--fasta_aa PATH Path to input Amino Acid FASTA file of proteins
[required]
--fasta_3di PATH Path to input 3Di FASTA file of proteins [required]
-o, --output PATH Output directory [default:
output_phold_foldseek_db]
-t, --threads INTEGER Number of threads to use with Foldseek [default: 1]
-p, --prefix TEXT Prefix for Foldseek database [default:
phold_foldseek_db]
-f, --force Force overwrites the output directory
phold plot
This in an auxillary command that allows you to create Circos plots for your phage(s) with pyCirclize. It requires only the phold Genbank file as its input and will output .png and .svg format files.
If you have annotated more than 1 contig with phold, phold plot will automatically plot them all in separate files. The contig ids will be the prefix of the output plot file names.
Example usage
phold plot -i phold.gbk -o phold_plots
Usage: phold plot [OPTIONS]
Creates Phold Circular Genome Plots
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Path to input file in Genbank format (in the
phold output directory) [required]
-o, --output PATH Output directory to store phold plots
[default: phold_plots]
-p, --prefix TEXT Prefix for output files. Needs to match what
phold was run with. [default: phold]
-f, --force Force overwrites the output directory
-a, --all Plot every contig.
-t, --plot_title TEXT Plot title. Only applies if --all is not
specified. Will default to the phage\'s
contig id.
--label_hypotheticals Flag to label hypothetical or unknown
proteins. By default these are not labelled
--remove_other_features_labels Flag to remove labels for
tRNA/tmRNA/CRISPRs. By default these are
labelled. They will still be plotted in
black
--title_size FLOAT Controls title size. Must be an integer.
Defaults to 20
--label_size INTEGER Controls annotation label size. Must be an
integer. Defaults to 8
--interval INTEGER Axis tick interval. Must be an integer. Must
be an integer. Defaults to 5000.
--truncate INTEGER Number of characters to include in annoation
labels before truncation with ellipsis.
Must be an integer. Defaults to 20.
--dpi INTEGER Resultion \(dots per inch\). Must be an
integer. Defaults to 600.
--annotations FLOAT Controls the proporition of annotations
labelled. Must be a proportion between 0 and
1 inclusive. 0 = no annotations, 0.5 = half
of the annotations, 1 = all annotations.
Defaults to 1. Chosen in order of CDS size.
--label_ids TEXT Text file with list of CDS IDs \(from gff
file\) that are guaranteed to be labelled.
phold autotune
This in an auxillary command that allows you to detect the most efficient batch size for your hardware.
It works by running ProstT5 on a variety of batch sizes, and it will print the best batch size to screen
The same functionality is availabe with phold run, phold predict and phold proteins-predict using --autotune (though with less parameters available to tweak compared to phold autotune). If you use --autotune with these subcommands, the best batch size will then be used automatically afterwards.
By default, phold autotune does not take an input file - it will use a sample of up to 5000 sample proteins from the Phold DB 1.36M
You can still specify a custom proteins .faa input to sample from using -i (in case yours is extremely different to Phold DB).
Example usage
phold autotune -d phold_db --min_batch 1 --step 20 --max_batch 1001 --sample_seqs 1000
Usage: phold autotune [OPTIONS]
Determines optimal batch size for 3Di prediction with your hardware
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Optional path to input file of proteins if you do not
want to use the default sample of 5000 Phold DB
proteins
--cpu Use cpus only.
-t, --threads INTEGER Number of threads [default: 1]
-d, --database TEXT Specific path to installed phold database
--min_batch INTEGER Minimum batch size to test [default: 1]
--step INTEGER Controls batch size step increment [default: 10]
--max_batch INTEGER Maximum batch size to test [default: 251]
--sample_seqs INTEGER Number of proteins to subsample from input.
[default: 500]