phold Logo

phold is a sensitive annotation tool for bacteriophage genomes and metagenomes using protein structural homology.

phold uses the ProstT5 protein language model to rapidly translate protein amino acid sequences to the 3Di token alphabet used by Foldseek. Foldseek is then used to search these against a database of over 1.36 million phage protein structures mostly predicted using Colabfold.

Alternatively, you can specify protein structures that you have pre-computed for your phage(s) instead of using ProstT5 with phold compare.

The phold databse consists of over 1.36 million protein structures generated using Colabfold and ESMFold from the following databases:

  • PHROGs — 441,177 de-deuplicated proteins. Proteins over 3000AA were chunked into equal components such that each fragment was under 3k.
  • enVhogs — 562,369 proteins from the ~2.2M ENVHOGs <3000AA that were assigned a PHROG.
  • efam - 233,181 efam "extra conservative" proteins that were assigned a PHROG.
  • DGRs - 12,683 extra diversity generating element proteins from Roux et al 2021.
  • VFDB — 27,823 structures of putative bacterial virulence factors from the VFDB.
  • CARD — 4,804 structures of anitbiotic resistant proteins from CARD.
  • acrDB - 3,652 anti-crispr proteins predicted in this study.
  • Defensefinder - 408 monomer prokaryotic defense proteins.
  • Netflax - 7,152 toxin-antitoxin proteins.

Google Colab Notebook

If you don't want to install phold locally, you can run it without any code using one this Google Colab notebook.

Pharokka, Phold and Phynteny are complimentary tools and when used together, they substantially increase the annotation rate of your phage genome. The below plot shows the annotation rate of different tools across 4 benchmarked datasets ((a) INPHARED 1419, (b) Cook, (c) Crass and (d) Tara - see the Phold preprint for more information)

Specifically, the final Phynteny plots combine the benefits of annotation with Pharokka (with HMM, the second violin) followed by Phold (with structures, the fourth violin) followed by Phynteny

pharokka plus phold plus phynteny