phold is sensititve annotation tool for bacteriophage genomes and metagenomes using protein strucutal homology.

phold uses the ProstT5 protein language model to translate protein amino acid sequences to the 3Di token alphabet used by foldseek. Foldseek is then used to search these against a database of 803k protein structures mostly predicted using Colabfold.

Alternatively, you can specify protein structures that you have pre-computed for your phage(s) instead of using ProstT5.

The phold databse consists of approximately 803k protein structures generated using Colabfold from the following databases:

  • PHROGs — 440549 de-deuplicated proteins. Proteins over 1000AA were chunked into 1000AA components.
  • ENVHOGs — 315k representative proteins from the 2.2M ENVHOGs that have a PHROG function that is not 'unknown function'. Proteins over 1000AA were chunked into 1000AA components.
  • VFDB — over 28k structures of putative bacterial virulence factors from the VFDB.
  • CARD — nearly 5k structures of anitbiotic resistant proteins from CARD.
  • acrDB - nearly 3.7k anti-crispr proteins predicted in this study.
  • Defensefinder - 455 monomer prokaryotic defense proteins.

Google Colab Notebooks

If you don't want to install phold locally, you can run it without any code using one of the following Google Colab notebooks: