Once boltz is installed, you can run predictions with:
boltz predict <INPUT_PATH> [OPTIONS]
<INPUT_PATH>can be either a single .yaml or .fasta file (YAML is preferred; FASTA is deprecated), or a directory, in which case predictions will be run on all.yamland.fastafiles inside.- If you include
--use_msa_server, the MSA will be generated automatically via the mmseqs2 server. Without this flag, you must provide a pre-computed MSA. - If you include
--use_potentials, Boltz will apply inference-time potentials to improve the physical plausibility of the predicted poses. - Without the
--overrideoptions, Boltz will try to use the cached preprocessed files and existing predictions, if any are present in your output directory (name of your input by default). Add the--overrideflag to run the prediction from scratch, e.g. if you change some parameters or complex details without changing the output directory.
Boltz takes inputs in .yaml format, which specifies the components of the complex.
Below is the full schema (each section is described in detail afterward):
sequences:
- ENTITY_TYPE:
id: CHAIN_ID
sequence: SEQUENCE # only for protein, dna, rna
smiles: 'SMILES' # only for ligand, exclusive with ccd
ccd: CCD # only for ligand, exclusive with smiles
msa: MSA_PATH # only for protein
modifications:
- position: RES_IDX # index of residue, starting from 1
ccd: CCD # CCD code of the modified residue
cyclic: false
- ENTITY_TYPE:
id: [CHAIN_ID, CHAIN_ID] # multiple ids in case of multiple identical entities
...
constraints:
- bond:
atom1: [CHAIN_ID, RES_IDX, ATOM_NAME]
atom2: [CHAIN_ID, RES_IDX, ATOM_NAME]
- pocket:
binder: CHAIN_ID
contacts: [[CHAIN_ID, RES_IDX/ATOM_NAME], [CHAIN_ID, RES_IDX/ATOM_NAME]]
max_distance: DIST_ANGSTROM
force: false # if force is set to true (default is false), a potential will be used to enforce the pocket constraint
- contact:
token1: [CHAIN_ID, RES_IDX/ATOM_NAME]
token2: [CHAIN_ID, RES_IDX/ATOM_NAME]
max_distance: DIST_ANGSTROM
force: false # if force is set to true (default is false), a potential will be used to enforce the contact constraint
templates:
- cif: CIF_PATH # if only a path is provided, Boltz will find the best matchings
force: true # optional, if force is set to true (default is false), a potential will be used to enforce the template
threshold: DISTANCE_THRESHOLD # optional, controls the distance (in Angstroms) that the prediction can deviate from the template
- cif: CIF_PATH
chain_id: CHAIN_ID # optional, specify which chain to find a template for
- cif: CIF_PATH
chain_id: [CHAIN_ID, CHAIN_ID] # can be more than one
template_id: [TEMPLATE_CHAIN_ID, TEMPLATE_CHAIN_ID]
- pdb: PDB_PATH # if a pdb path is provided, Boltz will incrementally assign template chain ids based on the chain names in the PDB file (A1, A2, B1, etc)
chain_id: [CHAIN_ID, CHAIN_ID]
template_id: [TEMPLATE_CHAIN_ID, TEMPLATE_CHAIN_ID]
properties:
- affinity:
binder: CHAIN_ID
The sequences section has one entry per unique chain or molecule.
- Polymers: use
ENTITY_TYPEequals toprotein,dna, orrna, and provide asequence. - Ligands (non-polymers): use
ENTITY_TYPEequalsligand, and provide either asmilesstring or accdcode (but not both). CHAIN_ID: unique identifier for each chain/molecule. If multiple identical entities exist, set id as a list (e.g.[A, B]).
For proteins:
- By default, an
msamust be provided. - If
--use_msa_serveris set, the MSA is auto-generated (somsacan be omitted). - To use a precomputed custom MSA, set
msa: MSA_PATHpointing to a.a3mfile. If you have more than one protein chain, use a CSV format instead of a3m with two columns:sequence(protein sequence) andkey(a unique identifier for matching rows across chains). Sequences with the same key are mutually aligned. - To force single-sequence mode (not recommended, as it reduces accuracy), set
msa: empty.
The modifications field is optional and allows specification of modified residues in polymers (protein, dna, or rna).
position: index of the residue (starting from 1)ccd: CCD code of the modified residue (currently supported only for CCD ligands)
The cyclic flag indicates whether a polymer chain (not ligands) is cyclic.
constraints is an optional field that allows you to specify additional information about the input structure.
-
The
bondconstraint specifies covalent bonds between two atoms (atom1andatom2). It is currently only supported for CCD ligands and canonical residues,CHAIN_IDrefers to the id of the residue set above,RES_IDXis the index (starting from 1) of the residue (1 for ligands), andATOM_NAMEis the standardized atom name (can be verified in CIF file of that component on the RCSB website). -
The
pocketconstraint specifies the residues associated with binding interaction, wherebinderrefers to the chain binding to the pocket (which can be a molecule, protein, DNA or RNA) andcontactsis the list of chain and residue indices (starting from 1, or atom names if the chain is a molecule) that form the binding site for thebinder.max_distancespecifies the maximum distance (in Angstrom, supported between 4A and 20A with 6A as default) between any atom in thebinderand any atom in each of thecontactselements. Ifforceis set to true, a potential will be used to enforce the pocket constraint. -
The
contactconstraint specifies a contact between two residues or atoms, wheretoken1andtoken2are the identifiers of the residues or atoms (in the format[CHAIN_ID, RES_IDX/ATOM_NAME]).max_distancespecifies the maximum distance (in Angstrom, supported between 4A and 20A with 6A as default) between any pair of atoms in the two elements. Ifforceis set to true, a potential will be used to enforce the contact constraint.
templates is optional and allows specification of structural templates for protein chains. At minimum, provide the path to a CIF or PDB file.
If you wish to explicitly define which of the chains in your YAML should be templated using this file, you can use the chain_id entry to specify them. If providing a PDB file, chain ids will be incrementally assigned to each subchain in a parent PDB chain resulting in template chain ids of A1, A2, B1, etc for PDB chains A and B. Make sure to look at the structure of the template PDB file to determine the corresponding value of template_id to provide. Whether a set of ids is provided or not, Boltz will find the best matching chains from the provided template. If you wish to explicitly define the mapping yourself, you may provide the corresponding template_id.
For any template you provide, you can also specify a force flag which will use a potential to enforce that the backbone does not deviate excessively from the template during the prediction. When using force one must specify also the threshold field which controls the distance (in Angstroms) that the prediction can deviate from the template.
properties is an optional field that allows you to specify whether you want to compute the affinity. If enabled, you must also provide the chain_id corresponding to the small molecule against which the affinity will be computed. Only one single small molecule can be specified for affinity computation. It must be a ligand chain (not a protein, DNA or RNA) and has to be at most 128 atoms counting heavy atoms and hydrogens kept by RDKit RemoveHs, however, we do not recommend running the affinity module with ligands significantly larger than 56 atoms (counted as above, limit set during training). At this point, Boltz only supports the computation of affinity of small molecules to protein targets, if ran with an RNA/DNA/co-factor target, the code will not crash but the output will be unreliable.
version: 1
sequences:
- protein:
id: [A, B]
sequence: MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
msa: ./examples/msa/seq1.a3m
- ligand:
id: [C, D]
ccd: SAH
- ligand:
id: [E, F]
smiles: 'N[C@@H](Cc1ccc(O)cc1)C(=O)O'The following options are available for the predict command:
boltz predict input_path [OPTIONS]
Examples of common options include:
-
Adding
--use_msa_serverflag, Boltz auto-generates the MSA using the mmseqs2 server. -
Adding the
--use_potentialsflag, Boltz uses an inference time potential that significantly improve the physical quality of the poses. -
To predict a structure using 10 recycling steps and 25 samples (the default parameters for AlphaFold3) use (note however that the prediction will take significantly longer):
--recycling_steps 10 --diffusion_samples 25
| Option | Type | Default | Description |
|---|---|---|---|
--out_dir |
PATH |
./ |
The path where to save the predictions. |
--cache |
PATH |
~/.boltz |
The directory where to download the data and model. Will use environment variable BOLTZ_CACHE as an absolute path if set |
--checkpoint |
PATH |
None | An optional checkpoint. Uses the provided Boltz-2 model by default. |
--devices |
INTEGER |
1 |
The number of devices to use for prediction. |
--accelerator |
[gpu,cpu,tpu] |
gpu |
The accelerator to use for prediction. |
--recycling_steps |
INTEGER |
3 |
The number of recycling steps to use for prediction. |
--sampling_steps |
INTEGER |
200 |
The number of sampling steps to use for prediction. |
--diffusion_samples |
INTEGER |
1 |
The number of diffusion samples to use for prediction. |
--max_parallel_samples |
INTEGER |
5 |
maximum number of samples to predict in parallel. |
--step_scale |
FLOAT |
1.638 |
The step size is related to the temperature at which the diffusion process samples the distribution. The lower the higher the diversity among samples (recommended between 1 and 2). |
--output_format |
[pdb,mmcif] |
mmcif |
The output format to use for the predictions. |
--num_workers |
INTEGER |
2 |
The number of dataloader workers to use for prediction. |
--method |
str | None | The method to use for prediction. |
--preprocessing-threads |
INTEGER |
multiprocessing.cpu_count() |
The number of threads to use for preprocessing. |
--affinity_mw_correction |
FLAG |
False |
Whether to add the Molecular Weight correction to the affinity value head. |
--sampling_steps_affinity |
INTEGER |
200 |
The number of sampling steps to use for affinity prediction. |
--diffusion_samples_affinity |
INTEGER |
5 |
The number of diffusion samples to use for affinity prediction. |
--affinity_checkpoint |
PATH |
None | An optional checkpoint for affinity. Uses the provided Boltz-2 model by default. |
--max_msa_seqs |
INTEGER |
8192 |
The maximum number of MSA sequences to use for prediction. |
--subsample_msa |
FLAG |
False |
Whether to subsample the MSA. |
--num_subsampled_msa |
INTEGER |
1024 |
The number of MSA sequences to subsample. |
--no_kernels |
FLAG |
False |
Whether to not use trifast kernels for triangular updates.. |
--override |
FLAG |
False |
Whether to override existing predictions if found. |
--use_msa_server |
FLAG |
False |
Whether to use the msa server to generate msa's. |
--msa_server_url |
str | https://api.colabfold.com |
MSA server url. Used only if --use_msa_server is set. |
--msa_pairing_strategy |
str | greedy |
Pairing strategy to use. Used only if --use_msa_server is set. Options are 'greedy' and 'complete' |
--use_potentials |
FLAG |
False |
Whether to run the original Boltz-2 model using inference time potentials. |
--write_full_pae |
FLAG |
False |
Whether to save the full PAE matrix as a file. |
--write_full_pde |
FLAG |
False |
Whether to save the full PDE matrix as a file. |
After running the model, the generated outputs are organized into the output directory following the structure below:
out_dir/
├── lightning_logs/ # Logs generated during training or evaluation
├── predictions/ # Contains the model's predictions
├── [input_file1]/
├── [input_file1]_model_0.cif # The predicted structure in CIF format, with the inclusion of per token pLDDT scores
├── confidence_[input_file1]_model_0.json # The confidence scores (confidence_score, ptm, iptm, ligand_iptm, protein_iptm, complex_plddt, complex_iplddt, chains_ptm, pair_chains_iptm)
├── affinity_[input_file1].json # The affinity scores (affinity_pred_value, affinity_probability_binary, affinity_pred_value1, affinity_probability_binary1, affinity_pred_value2, affinity_probability_binary2)
├── pae_[input_file1]_model_0.npz # The predicted PAE score for every pair of tokens
├── pde_[input_file1]_model_0.npz # The predicted PDE score for every pair of tokens
├── plddt_[input_file1]_model_0.npz # The predicted pLDDT score for every token
...
└── [input_file1]_model_[diffusion_samples-1].cif # The predicted structure in CIF format
...
└── [input_file2]/
...
└── processed/ # Processed data used during execution
The predictions folder contains a unique folder for each input file. The input folders contain diffusion_samples predictions saved in the output_format ordered by confidence score as well as additional files containing the predictions of the confidence model and affinity model. The processed folder contains the processed input files that the model uses during inference.
Each output folder includes a confidence .json file with aggregated confidence scores for that sample. Its structure is:
{
"confidence_score": 0.8367, # Aggregated score used to sort the predictions, corresponds to 0.8 * complex_plddt + 0.2 * iptm (ptm for single chains)
"ptm": 0.8425, # Predicted TM score for the complex
"iptm": 0.8225, # Predicted TM score when aggregating at the interfaces
"ligand_iptm": 0.0, # ipTM but only aggregating at protein-ligand interfaces
"protein_iptm": 0.8225, # ipTM but only aggregating at protein-protein interfaces
"complex_plddt": 0.8402, # Average pLDDT score for the complex
"complex_iplddt": 0.8241, # Average pLDDT score when upweighting interface tokens
"complex_pde": 0.8912, # Average PDE score for the complex
"complex_ipde": 5.1650, # Average PDE score when aggregating at interfaces
"chains_ptm": { # Predicted TM score within each chain
"0": 0.8533,
"1": 0.8330
},
"pair_chains_iptm": { # Predicted (interface) TM score between each pair of chains
"0": {
"0": 0.8533,
"1": 0.8090
},
"1": {
"0": 0.8225,
"1": 0.8330
}
}
}confidence_score, ptm and plddt scores (and their interface and individual chain analogues) have a range of [0, 1], where higher values indicate higher confidence. pde scores have a unit of angstroms, where lower values indicate higher confidence.
The output affinity .json file is organized as follows:
{
"affinity_pred_value": 0.8367, # Predicted binding affinity from the ensemble model
"affinity_probability_binary": 0.8425, # Predicted binding likelihood from the ensemble model
"affinity_pred_value1": 0.8225, # Predicted binding affinity from the first model of the ensemble
"affinity_probability_binary1": 0.0, # Predicted binding likelihood from the first model in the ensemble
"affinity_pred_value2": 0.8225, # Predicted binding affinity from the second model of the ensemble
"affinity_probability_binary2": 0.8402, # Predicted binding likelihood from the second model in the ensemble
}There are two main predictions in the affinity output: affinity_pred_value and affinity_probability_binary. They are trained on largely different datasets, with different supervisions, and should be used in different contexts.
The affinity_probability_binary field should be used to detect binders from decoys, for example in a hit-discovery stage. It's value ranges from 0 to 1 and represents the predicted probability that the ligand is a binder.
The affinity_pred_value aims to measure the specific affinity of different binders and how this changes with small modifications of the molecule (note that this implies that it should only be used when comparing different active molecules, not inactives). This should be used in ligand optimization stages such as hit-to-lead and lead-optimization. It reports a binding affinity value as log10(IC50), derived from an IC50 measured in μM. Lower values indicate stronger predicted binding, for instance:
- IC50 of
$10^{-9}$ M$\longrightarrow$ our model outputs$-3$ (strong binder) - IC50 of
$10^{-6}$ M$\longrightarrow$ our model outputs$0$ (moderate binder) - IC50 of
$10^{-4}$ M$\longrightarrow$ our model outputs$2$ (weak binder / decoy)
You can convert the model's output to pIC50 in kcal/mol by using y --> (6 - y) * 1.364 where y is the model's prediction.
When using the --use_msa_server option with a server that requires authentication, you can provide credentials in one of two ways:
- Use the CLI options
--msa_server_usernameand--msa_server_password. - Or, set the environment variables:
BOLTZ_MSA_USERNAME(for the username)BOLTZ_MSA_PASSWORD(for the password, recommended for security)
Example:
export BOLTZ_MSA_USERNAME=myuser
export BOLTZ_MSA_PASSWORD=mypassword
boltz predict ... --use_msa_serverOr:
boltz predict ... --use_msa_server --msa_server_username myuser --msa_server_password mypassword- Use the CLI options
--api_key_header(default:X-API-Key) and--api_key_valueto specify the header and value for API key authentication. - Or, set the API key value via the environment variable
MSA_API_KEY_VALUE(recommended for security).
Example using CLI:
boltz predict ... --use_msa_server --api_key_header X-API-Key --api_key_value <your-api-key>Example using environment variable:
export MSA_API_KEY_VALUE=<your-api-key>
boltz predict ... --use_msa_server --api_key_header X-API-KeyIf both the CLI option and environment variable are set, the CLI option takes precedence.
If your server expects a different header, set
--api_key_headeraccordingly (e.g.,--api_key_header X-Gravitee-Api-Key).
Note:
Only one authentication method (basic or API key) can be used at a time. If both are provided, the program will raise an error.
FASTA format is still supported but is deprecated and only supports a limited subset of features compared to YAML.
| Feature | Fasta | YAML |
|---|---|---|
| Polymers | ✅ | ✅ |
| Smiles | ✅ | ✅ |
| CCD code | ✅ | ✅ |
| Custom MSA | ✅ | ✅ |
| Modified Residues | ❌ | ✅ |
| Covalent bonds | ❌ | ✅ |
| Pocket conditioning | ❌ | ✅ |
| Affinity | ❌ | ✅ |
It contain entries as follows:
>CHAIN_ID|ENTITY_TYPE|MSA_PATH
SEQUENCE
The CHAIN_ID is a unique identifier for each input chain. The ENTITY_TYPE can be one of protein, dna, rna, smiles, ccd (note that we support both smiles and CCD code for ligands). The MSA_PATH is only applicable to proteins. By default, MSA's are required, but they can be omited by passing the --use_msa_server flag which will auto-generate the MSA using the mmseqs2 server. If you wish to use a custom MSA, use it to set the path to the .a3m file containing a pre-computed MSA for this protein. If you wish to explicitly run single sequence mode (which is generally advised against as it will hurt model performance), you may do so by using the special keyword empty for that protein (ex: >A|protein|empty). For custom MSA, you may wish to indicate pairing keys to the model. You can do so by using a CSV format instead of a3m with two columns: sequence with the protein sequences and key which is a unique identifier indicating matching rows across CSV files of each protein chain.
For each of these cases, the corresponding SEQUENCE will contain an amino acid sequence (e.g. EFKEAFSLF), a sequence of nucleotide bases (e.g. ATCG), a smiles string (e.g. CC1=CC=CC=C1), or a CCD code (e.g. ATP), depending on the entity.
As an example:
>A|protein|./examples/msa/seq1.a3m
MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
>B|protein|./examples/msa/seq1.a3m
MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
>C|ccd
SAH
>D|ccd
SAH
>E|smiles
N[C@@H](Cc1ccc(O)cc1)C(=O)O
>F|smiles
N[C@@H](Cc1ccc(O)cc1)C(=O)O- When running on old NVIDIA GPUs, you may encounter an error related to the
cuequivariancelibrary. In this case, you should run the model with the--no_kernelsflag, which will disable the use of thecuequivariancelibrary and allow the model to run without it. This may result in slightly lower performance, but it will allow you to run the model on older hardware.