phold Method
ProstT5 3Di Inference
pholdbegins by predicting the Foldseek 3Di tokens for every input protein using the ProstT5 protein language model- Alternatively, this step is skipped if you run
phold compareand specify pre-computed protein structures in the .pdb or .cif formats using the--structuresflag along with specifying the firectory containing the structures with--structure_dir
Foldseek Structural Comparison
pholdthen creates a Foldseek database combining the AA and 3Di representations of each protein, and compares this to thepholddatabase with Foldseek- Alternatively, you can specify protein structures that you have pre-computed for your phage(s) instead of using ProstT5 with the parameter
--structuresand--structure_dir
Downstream Annotation Processing
-
For each protein the following logic is conducted to select the top hit (if the input is not a Pharokka Genbank, the pharokka steps are skipped)
-
If there is at least 1 Foldseek hit that is non-hypothetical function, the top hit is considered to be that with the lowest evalue
- Otherwise, if Pharokka non-hypothetical function hit exists in the Genbank input, take is considered the top hit
- Otherwise, if only hypothetical proteins are found as foldseek hits, the hit with the lowest evalue (this PHROG's function may be discovered in the future)
- Otherwise, if there are no foldseek hits and only a Pharokka hypothetical hit exists from the Genbank input, that is taken
- Otherwise, if there are no foldseek hits and no Pharokka hit, then it is labelled ‘No_PHROG’ + hypothetical protein
phold search database
The phold databse consists of over 1.36 million protein structures generated using Colabfold and ESMFold from the following databases:
- PHROGs — 441,177 de-deuplicated proteins. Proteins over 3000AA were chunked into equal components such that each fragment was under 3k.
- enVhogs — 562,369 proteins from the ~2.2M ENVHOGs <3000AA that were assigned a PHROG.
- efam - 233,181 efam "extra conservative" proteins that were assigned a PHROG.
- DGRs - 12,683 extra diversity generating element proteins from Roux et al 2021.
- VFDB — 27,823 structures of putative bacterial virulence factors from the VFDB.
- CARD — 4,804 structures of anitbiotic resistant proteins from CARD.
- acrDB - 3,652 anti-crispr proteins predicted in this study.
- Defensefinder - 408 monomer prokaryotic defense proteins.
- Netflax - 7,152 toxin-antitoxin proteins.