Running phold

phold has been split into two overall modules regardless of your input - predict and compare.

The reason phold has been split as such is to enable optimal resource usage between GPU and CPU in cluster environments.

If you have a local workstation with GPU, please use phold run, which combines both in one.

If you have a local workstation without a GPU, please use phold run with --cpu.

If you have trouble running ProstT5, please use phold remote.

If you are having troubing runing phold offline after installation (e.g. on a HPC), you may need to add TRANSFORMERS_OFFLINE=True to your environment.

Input

Most subcommands of phold takes as their input an entry GenBank formatted file that contains the output of pharokka for your phage or phage contigs. This will be called pharokka.gbk by default in your pharokka output. phold will also accept bakta GenBank or NCBI RefSeq GenBank format input.

Alternatively, phold will detect if the input is a FASTA contig/genome file as input. If so, Pyrodigal-gv will then be run to quickly predict the CDS and these will be annotated. However, neither tRNAs, tmRNA nor CRISPR repeats will be predicted (unlike in pharokka) for now.

For phold proteins-predict and phold proteins-compare, the input will be a FASTA format file containing amino acid protein sequences. These commands are useful for annotating bulk phage proteins.

Subcommands

phold predict

predict uses the ProstT5 protein language model to translate protein amino acid sequences to the 3Di token alphabet used by foldseek. This module is greatly accelerated if you have a GPU available and is recommended.

Usage: phold predict [OPTIONS]

  Uses ProstT5 to predict 3Di tokens - GPU recommended

Options:
  -h, --help                     Show this message and exit.
  -V, --version                  Show the version and exit.
  -i, --input PATH               Path to input file in Genbank format or
                                 nucleotide FASTA format  [required]
  -o, --output PATH              Output directory   [default: output_phold]
  -t, --threads INTEGER          Number of threads  [default: 1]
  -p, --prefix TEXT              Prefix for output files  [default: phold]
  -d, --database TEXT            Specific path to installed phold database
  -f, --force                    Force overwrites the output directory
  --autotune                     Run autotuning to detect and automatically
                                 use best batch size for your hardware.
                                 Recommended only if you have a large dataset
                                 (e.g. thousands of proteins), or else
                                 autotuning will add rather than save runtime.
  --batch_size INTEGER           batch size for ProstT5.  [default: 1]
  --cpu                          Use cpus only.
  --omit_probs                   Do not output per residue 3Di probabilities
                                 from ProstT5. Mean per protein 3Di
                                 probabilities will always be output.
  --save_per_residue_embeddings  Save the ProstT5 embeddings per resuide in a
                                 h5 file
  --save_per_protein_embeddings  Save the ProstT5 embeddings as means per
                                 protein in a h5 file
  --mask_threshold FLOAT         Masks 3Di residues below this value of
                                 ProstT5 confidence for Foldseek searches
                                 [default: 25]
  --finetune                     Use gbouras13/ProstT5Phold encoder + CNN
                                 model both finetuned on phage proteins
  --vanilla                      Use vanilla CNN model (trained on CASP14)
                                 with ProstT5Phold encoder instead of the one
                                 trained on phage proteins
  --hyps                         Use this to only annotate hypothetical
                                 proteins from a Pharokka GenBank input

Example usage (assuming you have run phold install)

phold predict -i pharokka.gbk -o phold_predict_output 

phold compare

phold compare runs foldseek to compare ProstT5 predictions generated by phold predict to the phold database.

Alternatively, if you have provided pre-generated .pdb format protein structures for you proteins, you can specify those by specifiying --structures --structure_dir <directory>.

phold compare does not use a GPU by default. However, if you have one available, you can utilise Foldseek-GPU acceleration using --foldseek_gpu. Note that you need to make sure your also run phold install with --foldseek_gpu prior. Regardless of whether you use --foldseek_gpu or not, it is recommended to use as many CPU threads with -t as you can (as the GPU only accelerates Foldseek's prefilter, not the alignment step).

Example usage of phold compare following phold predict

phold compare -i pharokka.gbk -o phold_compare_output  --predictions_dir phold_predict_output

Example usage if you have .pdb or .cif format structures available for your phage proteins (note: the .pdb file names must be called cds_id.pdb, where cds_id is the CDS ids output from Pharokka). You can see an example in the tests/test_data/NC_043029_pdbs directory here.

phold compare -i pharokka.gbk -o phold_compare_output_pdb  --pdb --pdb_dir directory_with_pdbs  -t 8

If you have a custom database of protein structures would would additionally like to search against, Phold will support this using --custom_db, with the You will first need to use foldseek createdb first to create a Foldseek compatible database. For example, assuming you have protein structures in a directory called custom_structures

foldseek createdb custom_structures/ mycustomdb
phold compare -i pharokka.gbk -o phold_compare_output_with_custom_db  --custom_db mycustomdb    -t 8
Usage: phold compare [OPTIONS]

  Runs Foldseek vs phold db

Options:
  -h, --help                    Show this message and exit.
  -V, --version                 Show the version and exit.
  -i, --input PATH              Path to input file in Genbank format or
                                nucleotide FASTA format  [required]
  --predictions_dir PATH        Path to output directory from phold predict
  --structures                  Use if you have .pdb or .cif file structures
                                for the input proteins (e.g. with
                                AF2/Colabfold .pdb or AF3 for .cif) in a
                                directory that you specify with
                                --structure_dir
  --structure_dir PATH          Path to directory with .pdb or .cif file
                                structures. The CDS IDs need to be in the name
                                of the file
  --filter_structures           Flag that creates a copy of the .pdb or .cif
                                files structures with matching record IDs
                                found in the input GenBank file. Helpful if
                                you have a directory with lots of .pdb files
                                and want to annotate only e.g. 1 phage.
  -o, --output PATH             Output directory   [default: output_phold]
  -t, --threads INTEGER         Number of threads  [default: 1]
  -p, --prefix TEXT             Prefix for output files  [default: phold]
  -d, --database TEXT           Specific path to installed phold database
  -f, --force                   Force overwrites the output directory
  -e, --evalue FLOAT            Evalue threshold for Foldseek  [default: 1e-3]
  -s, --sensitivity FLOAT       Sensitivity parameter for foldseek  [default:
                                9.5]
  --keep_tmp_files              Keep temporary intermediate files,
                                particularly the large foldseek_results.tsv of
                                all Foldseek hits
  --card_vfdb_evalue FLOAT      Stricter E-value threshold for Foldseek CARD
                                and VFDB hits  [default: 1e-10]
  --separate                    Output separate GenBank files for each contig
  --max_seqs INTEGER            Maximum results per query sequence allowed to
                                pass the prefilter. You may want to reduce
                                this to save disk space for enormous datasets
                                [default: 1000]
  --ultra_sensitive             Runs phold with maximum sensitivity by
                                skipping Foldseek prefilter. Not recommended
                                for large datasets.
  --extra_foldseek_params TEXT  Extra foldseek search params
  --custom_db TEXT              Path to custom database
  --foldseek_gpu                Use this to enable compatibility with
                                Foldseek-GPU search acceleration
  --restart                     Use this to restart phold from 'Processing
                                Foldseek output' after foldseek_results.tsv is
                                generated

phold run

phold run runs phold predict and phold compare together in one command. Recommended if you are running phold on a local workstation with an available GPU. If you have an NVIDIA GPU, --foldseek_gpu is recommended to accelerate Foldseek.

Also recommended if you are running phold in a low-resource environment without a GPU e.g. with a laptop - you will also need to specify --cpu.

Example usage where NVIDIA GPU is available:

phold run -i pharokka.gbk -o phold_output  -t 8 --foldseek_gpu
Usage: phold run [OPTIONS]

  phold predict then comapare all in one - GPU recommended

Options:
  -h, --help                     Show this message and exit.
  -V, --version                  Show the version and exit.
  -i, --input PATH               Path to input file in Genbank format or
                                 nucleotide FASTA format  [required]
  -o, --output PATH              Output directory   [default: output_phold]
  -t, --threads INTEGER          Number of threads  [default: 1]
  -p, --prefix TEXT              Prefix for output files  [default: phold]
  -d, --database TEXT            Specific path to installed phold database
  -f, --force                    Force overwrites the output directory
  --autotune                     Run autotuning to detect and automatically
                                 use best batch size for your hardware.
                                 Recommended only if you have a large dataset
                                 (e.g. thousands of proteins), or else
                                 autotuning will add rather than save runtime.
  --batch_size INTEGER           batch size for ProstT5.  [default: 1]
  --cpu                          Use cpus only.
  --omit_probs                   Do not output per residue 3Di probabilities
                                 from ProstT5. Mean per protein 3Di
                                 probabilities will always be output.
  --save_per_residue_embeddings  Save the ProstT5 embeddings per resuide in a
                                 h5 file
  --save_per_protein_embeddings  Save the ProstT5 embeddings as means per
                                 protein in a h5 file
  --mask_threshold FLOAT         Masks 3Di residues below this value of
                                 ProstT5 confidence for Foldseek searches
                                 [default: 25]
  --finetune                     Use gbouras13/ProstT5Phold encoder + CNN
                                 model both finetuned on phage proteins
  --vanilla                      Use vanilla CNN model (trained on CASP14)
                                 with ProstT5Phold encoder instead of the one
                                 trained on phage proteins
  --hyps                         Use this to only annotate hypothetical
                                 proteins from a Pharokka GenBank input
  -e, --evalue FLOAT             Evalue threshold for Foldseek  [default:
                                 1e-3]
  -s, --sensitivity FLOAT        Sensitivity parameter for foldseek  [default:
                                 9.5]
  --keep_tmp_files               Keep temporary intermediate files,
                                 particularly the large foldseek_results.tsv
                                 of all Foldseek hits
  --card_vfdb_evalue FLOAT       Stricter E-value threshold for Foldseek CARD
                                 and VFDB hits  [default: 1e-10]
  --separate                     Output separate GenBank files for each contig
  --max_seqs INTEGER             Maximum results per query sequence allowed to
                                 pass the prefilter. You may want to reduce
                                 this to save disk space for enormous datasets
                                 [default: 1000]
  --ultra_sensitive              Runs phold with maximum sensitivity by
                                 skipping Foldseek prefilter. Not recommended
                                 for large datasets.
  --extra_foldseek_params TEXT   Extra foldseek search params
  --custom_db TEXT               Path to custom database
  --foldseek_gpu                 Use this to enable compatibility with
                                 Foldseek-GPU search acceleration
  --restart                      Use this to restart phold from 'Processing
                                 Foldseek output' after foldseek_results.tsv
                                 is generated

phold proteins-predict

Identical to phold predict, but instead takes a FASTA input file of amino acid protein sequences. Useful for bulk annotation of phage proteins.

Example usage

phold proteins-predict -i phage_proteins.faa -o phold_proteins_predict_output 
Usage: phold proteins-predict [OPTIONS]

  Runs ProstT5 on a multiFASTA input - GPU recommended

Options:
  -h, --help                     Show this message and exit.
  -V, --version                  Show the version and exit.
  -i, --input PATH               Path to input multiFASTA file  [required]
  -o, --output PATH              Output directory   [default: output_phold]
  -t, --threads INTEGER          Number of threads  [default: 1]
  -p, --prefix TEXT              Prefix for output files  [default: phold]
  -d, --database TEXT            Specific path to installed phold database
  -f, --force                    Force overwrites the output directory
  --autotune                     Run autotuning to detect and automatically
                                 use best batch size for your hardware.
                                 Recommended only if you have a large dataset
                                 (e.g. thousands of proteins), or else
                                 autotuning will add rather than save runtime.
  --batch_size INTEGER           batch size for ProstT5.  [default: 1]
  --cpu                          Use cpus only.
  --omit_probs                   Do not output per residue 3Di probabilities
                                 from ProstT5. Mean per protein 3Di
                                 probabilities will always be output.
  --save_per_residue_embeddings  Save the ProstT5 embeddings per resuide in a
                                 h5 file
  --save_per_protein_embeddings  Save the ProstT5 embeddings as means per
                                 protein in a h5 file
  --mask_threshold FLOAT         Masks 3Di residues below this value of
                                 ProstT5 confidence for Foldseek searches
                                 [default: 25]
  --finetune                     Use gbouras13/ProstT5Phold encoder + CNN
                                 model both finetuned on phage proteins
  --vanilla                      Use vanilla CNN model (trained on CASP14)
                                 with ProstT5Phold encoder instead of the one
                                 trained on phage proteins
  --hyps                         Use this to only annotate hypothetical
                                 proteins from a Pharokka GenBank input

phold proteins-compare

Identical to phold compare, but instead takes a FASTA input file of amino acid protein sequences. Useful for bulk annotation of phage proteins.

Example usage

phold proteins-compare -i phage_proteins.faa --predictions_dir phold_proteins_predict_output -o phold_proteins_compare_output  -t 8
Usage: phold proteins-compare [OPTIONS]

  Runs Foldseek vs phold db on proteins input

Options:
  -h, --help                    Show this message and exit.
  -V, --version                 Show the version and exit.
  -i, --input PATH              Path to input file in multiFASTA format
                                [required]
  --predictions_dir PATH        Path to output directory from phold proteins-
                                predict
  --structures                  Use if you have .pdb or .cif file structures
                                for the input proteins (e.g. with
                                AF2/Colabfold) in a directory that you specify
                                with --structure_dir
  --structure_dir PATH          Path to directory with .pdb or .cif file
                                structures. The CDS IDs need to be in the name
                                of the file
  --filter_structures           Flag that creates a copy of the .pdb or .cif
                                files structures with matching record IDs
                                found in the input GenBank file. Helpful if
                                you have a directory with lots of .pdb files
                                and want to annotate only e.g. 1 phage.
  -o, --output PATH             Output directory   [default: output_phold]
  -t, --threads INTEGER         Number of threads  [default: 1]
  -p, --prefix TEXT             Prefix for output files  [default: phold]
  -d, --database TEXT           Specific path to installed phold database
  -f, --force                   Force overwrites the output directory
  -e, --evalue FLOAT            Evalue threshold for Foldseek  [default: 1e-3]
  -s, --sensitivity FLOAT       Sensitivity parameter for foldseek  [default:
                                9.5]
  --keep_tmp_files              Keep temporary intermediate files,
                                particularly the large foldseek_results.tsv of
                                all Foldseek hits
  --card_vfdb_evalue FLOAT      Stricter E-value threshold for Foldseek CARD
                                and VFDB hits  [default: 1e-10]
  --separate                    Output separate GenBank files for each contig
  --max_seqs INTEGER            Maximum results per query sequence allowed to
                                pass the prefilter. You may want to reduce
                                this to save disk space for enormous datasets
                                [default: 1000]
  --ultra_sensitive             Runs phold with maximum sensitivity by
                                skipping Foldseek prefilter. Not recommended
                                for large datasets.
  --extra_foldseek_params TEXT  Extra foldseek search params
  --custom_db TEXT              Path to custom database
  --foldseek_gpu                Use this to enable compatibility with
                                Foldseek-GPU search acceleration
  --restart                     Use this to restart phold from 'Processing
                                Foldseek output' after foldseek_results.tsv is
                                generated

phold remote

This command queries the Foldseek webserver to predict the 3Di sequence instead of running ProstT5 locally, followed by a local Foldseek search against the Phold database locally. This is recommended for users with extremely low compute (such as an old laptop) or who can't get ProstT5 to run on their machine.

I would recommend our Colab notebook instead to be honest.

Example usage

phold remote -i pharokka.gbk  -o phold_remote_output  -t 8
Usage: phold remote [OPTIONS]

  Uses Foldseek API to run ProstT5 then Foldseek locally

Options:
  -h, --help                    Show this message and exit.
  -V, --version                 Show the version and exit.
  -i, --input PATH              Path to input file in Genbank format or
                                nucleotide FASTA format  [required]
  -o, --output PATH             Output directory   [default: output_phold]
  -t, --threads INTEGER         Number of threads  [default: 1]
  -p, --prefix TEXT             Prefix for output files  [default: phold]
  -d, --database TEXT           Specific path to installed phold database
  -f, --force                   Force overwrites the output directory
  -e, --evalue FLOAT            Evalue threshold for Foldseek  [default: 1e-3]
  -s, --sensitivity FLOAT       Sensitivity parameter for foldseek  [default:
                                9.5]
  --keep_tmp_files              Keep temporary intermediate files,
                                particularly the large foldseek_results.tsv of
                                all Foldseek hits
  --card_vfdb_evalue FLOAT      Stricter E-value threshold for Foldseek CARD
                                and VFDB hits  [default: 1e-10]
  --separate                    Output separate GenBank files for each contig
  --max_seqs INTEGER            Maximum results per query sequence allowed to
                                pass the prefilter. You may want to reduce
                                this to save disk space for enormous datasets
                                [default: 1000]
  --ultra_sensitive             Runs phold with maximum sensitivity by
                                skipping Foldseek prefilter. Not recommended
                                for large datasets.
  --extra_foldseek_params TEXT  Extra foldseek search params
  --custom_db TEXT              Path to custom database
  --foldseek_gpu                Use this to enable compatibility with
                                Foldseek-GPU search acceleration

phold createdb

This in an auxillary command that allows you to create a Foldseek compatible database from AA and 3Di protein sequences (such as those created by phold predict).

Example usage

phold createdb --fasta_aa phold_aa.fasta  --fasta_3di phold_3di.fasta -o my_foldseek_db  
Usage: phold createdb [OPTIONS]

  Creates foldseek DB from AA FASTA and 3Di FASTA input files

Options:
  -h, --help             Show this message and exit.
  -V, --version          Show the version and exit.
  --fasta_aa PATH        Path to input Amino Acid FASTA file of proteins
                         [required]
  --fasta_3di PATH       Path to input 3Di FASTA file of proteins  [required]
  -o, --output PATH      Output directory   [default:
                         output_phold_foldseek_db]
  -t, --threads INTEGER  Number of threads to use with Foldseek  [default: 1]
  -p, --prefix TEXT      Prefix for Foldseek database  [default:
                         phold_foldseek_db]
  -f, --force            Force overwrites the output directory

phold plot

This in an auxillary command that allows you to create Circos plots for your phage(s) with pyCirclize. It requires only the phold Genbank file as its input and will output .png and .svg format files.

If you have annotated more than 1 contig with phold, phold plot will automatically plot them all in separate files. The contig ids will be the prefix of the output plot file names.

Example usage

phold plot -i phold.gbk -o phold_plots
Usage: phold plot [OPTIONS]

  Creates Phold Circular Genome Plots

Options:
  -h, --help                      Show this message and exit.
  -V, --version                   Show the version and exit.
  -i, --input PATH                Path to input file in Genbank format (in the
                                  phold output directory)  [required]
  -o, --output PATH               Output directory to store phold plots
                                  [default: phold_plots]
  -p, --prefix TEXT               Prefix for output files. Needs to match what
                                  phold was run with.  [default: phold]
  -f, --force                     Force overwrites the output directory
  -a, --all                       Plot every contig.
  -t, --plot_title TEXT           Plot title. Only applies if --all is not
                                  specified. Will default to the phage\'s
                                  contig id.
  --label_hypotheticals           Flag to label hypothetical or unknown
                                  proteins. By default these are not labelled
  --remove_other_features_labels  Flag to remove labels for
                                  tRNA/tmRNA/CRISPRs. By default these are
                                  labelled.  They will still be plotted in
                                  black
  --title_size FLOAT              Controls title size. Must be an integer.
                                  Defaults to 20
  --label_size INTEGER            Controls annotation label size. Must be an
                                  integer. Defaults to 8
  --interval INTEGER              Axis tick interval. Must be an integer. Must
                                  be an integer. Defaults to 5000.
  --truncate INTEGER              Number of characters to include in annoation
                                  labels before truncation with ellipsis.
                                  Must be an integer. Defaults to 20.
  --dpi INTEGER                   Resultion \(dots per inch\). Must be an
                                  integer. Defaults to 600.
  --annotations FLOAT             Controls the proporition of annotations
                                  labelled. Must be a proportion between 0 and
                                  1 inclusive.  0 = no annotations, 0.5 = half
                                  of the annotations, 1 = all annotations.
                                  Defaults to 1. Chosen in order of CDS size.
  --label_ids TEXT                Text file with list of CDS IDs \(from gff
                                  file\) that are guaranteed to be labelled.

phold autotune

This in an auxillary command that allows you to detect the most efficient batch size for your hardware.

It works by running ProstT5 on a variety of batch sizes, and it will print the best batch size to screen

The same functionality is availabe with phold run, phold predict and phold proteins-predict using --autotune (though with less parameters available to tweak compared to phold autotune). If you use --autotune with these subcommands, the best batch size will then be used automatically afterwards.

By default, phold autotune does not take an input file - it will use a sample of up to 5000 sample proteins from the Phold DB 1.36M

You can still specify a custom proteins .faa input to sample from using -i (in case yours is extremely different to Phold DB).

Example usage

phold autotune -d phold_db --min_batch 1 --step 20 --max_batch 1001 --sample_seqs 1000
Usage: phold autotune [OPTIONS]

  Determines optimal batch size for 3Di prediction with your hardware

Options:
  -h, --help             Show this message and exit.
  -V, --version          Show the version and exit.
  -i, --input PATH       Optional path to input file of proteins if you do not
                         want to use the default sample of 5000 Phold DB
                         proteins
  --cpu                  Use cpus only.
  -t, --threads INTEGER  Number of threads  [default: 1]
  -d, --database TEXT    Specific path to installed phold database
  --min_batch INTEGER    Minimum batch size to test  [default: 1]
  --step INTEGER         Controls batch size step increment  [default: 10]
  --max_batch INTEGER    Maximum batch size to test  [default: 251]
  --sample_seqs INTEGER  Number of proteins to subsample from input.
                         [default: 500]