Skip to content

OTeam-AI4S/ODesign-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ODesign

Web Server Technical Report Project Page WhatsApp WeChat

ODesign-pipeline

πŸŽ‰ Here we present ODesign-pipeline, a unified design pipeline for proteins, nucleic acids, and small molecules. It couples ODesign with a filtering stack based on AlphaFold3 ReFold and PyRosetta Score.

πŸŽ‰ In our wet-lab validation, we performed one round of mini-protein design across eight targets; four reached picomolar affinity.

πŸ› οΈ Peptide- and aptamer-design pipelines are under development.

ODesign

Table of contents

Installation

Step 1 β€” Init Main Repo

git clone https://github.com/The-Institute-for-AI-Molecular-Design/ODesign-pipeline.git
cd ODesign-pipeline

Step 2 β€” Prepare Sampling Module (ODesign)

git clone https://github.com/The-Institute-for-AI-Molecular-Design/ODesign.git
  1. create environment
conda create -n odesign python=3.10
conda activate odesign
pip install -r ./ODesign/requirements.txt -f https://data.pyg.org/whl/torch-2.3.1+cu121.html
  1. get required inference data
  • get checkpoints
bash ./ODesign/ckpt/get_odesign_ckpt.sh ./ODesign/ckpt
  • get required inference data

Before running inference for the first time, please download the components.v20240608.cif and components.v20240608.cif.rdkit_mol.pkl from Google Drive, and place these files under ./ODesign/data.

Step 3 β€” Prepare Filter Module (AlphaFold3 X PyRosetta)

  • AlphaFold3

please refer to AlphaFold3 Installation Guide for more details.

  • PyRosetta
pip install pyrosetta-installer
python -c 'import pyrosetta_installer; pyrosetta_installer.install_pyrosetta()'

please refer to PyRosetta Docs for more details.

Repo Layout

This pipeline assumes the following directory structure (relative to this repo root):

ODesign-pipeline/
β”œβ”€β”€ examples/
β”œβ”€β”€ filter/
β”œβ”€β”€ ODesign/
β”‚   β”œβ”€β”€ ckpt/
β”‚   β”œβ”€β”€ data/
β”œβ”€β”€ scripts/
β”œβ”€β”€ utils/
└── README.md

Usage

Sampling

After installation, launch the ODesign sampling process using:

bash ./scripts/run_odesign.sh \
  --infer_model_name odesign_base_prot_flex \
  --data_root_dir ./ODesign/data \
  --ckpt_root_dir ./ODesign/ckpt \
  --input_json_path ./examples/prot_binder/prot_binder.json \
  --exp_name prot_binder \
  --seeds "[42, 123]" \
  --N_sample 20 \
  --invfold_topk 8 \
  --output_dir ./examples/prot_binder/odesign_out \
  --gpus 0

In practical applications, we ran large-scale inference (over 10k designs). Increasing the number of seeds significantly improves the diversity and validity of generated designs. Therefore, we recommend keeping N_sample fixed at 20 and increasing the number of seeds to scale up the total sampling budget. 100 seeds are typically used in our large-scale design campaigns.

Argument Description Example
infer_model_name Model used for inference. odesign_base_prot_flex
data_root_dir Directory where downloaded data is stored. ./ODesign/data
ckpt_root_dir Directory where model checkpoints are stored. ./ODesign/ckpt
input_json_path Path to the input design specification JSON file. ./examples/prot_binder/prot_binder.json
exp_name Custom label for inference output directory. prot_binder
seeds Random seeds used during generation. Supports multiple seeds. [42] or [42, 123]
N_sample Number of generated samples per seed. 20
invfold_topk Number inverse folding per backbone. 8
output_dir ODesign inference output directory. ./examples/prot_binder/odesign_out
gpus GPU device for inference. 0 or 0,1,2,3
  • Output Files
pdl1_prot_binder/
β”œβ”€β”€ seed_42/
β”‚   └── predictions/
β”‚       β”œβ”€β”€ pdl1_prot_binder_seed_42_bb_0_seq_0.cif
β”‚       └── ...
└── seed_123/
  • Inverse Folding (Optional)

If you are more confortable with ProteinMPNN, here is a workflow for ODesign Backbones & ProteinMPNN Inverse Folding.

  1. Prepare LigandMPNN
# clone repo
git clone git clone https://github.com/dauparas/LigandMPNN.git

# build environment
conda create -n ligandmpnn_env python=3.11
conda activate ligandmpnn_env
pip3 install -r ./LigandMPNN/requirements.txt

# get model parameters
bash ./LigandMPNN/get_model_params.sh "./LigandMPNN/model_params"
  1. Run command
# run inverse folding (for ODesign results)
conda activate odesign
export PYTHONPATH="$(pwd):$PYTHONPATH"
python ./utils/prepare_mpnn_input.py \
  --input_path ./examples/prot_binder/odesign_out/pdl1_prot_binder \
  --output_path ./examples/prot_binder/mpnn_redesign_out \
  --redesign_chain_id A

conda activate ligandmpnn_env
python ./scripts/run_mpnn_redesign.py \
  --input_json ./examples/prot_binder/mpnn_redesign_out/mpnn_input.json \
  --output_dir ./examples/prot_binder/mpnn_redesign_out \
  --exp_name prot_binder \
  --binder_chain_id A \
  --batch_size 1 \
  --number_of_batches 1 \
  --temperature 0.1 \
  --omit_AA C \
  --mpnn_model_type protein_mpnn \
  --protein_mpnn_ckpts ./LigandMPNN/model_params/proteinmpnn_v_48_020.pt \
  --gpus 1,2

Filter

We built the filtering module of the ODesign pipeline based on AF3 and PyRosetta, integrating deep-learning and physics-based modeling perspectives. The main workflow consists of AF3 ReFold and scoring using AF3 confidence metrics and PyRosetta scores.

AF3 ReFold

We provide a simple guideline for running AlphaFold3 on large-scale scoring tasks.

  1. Pre-search
  • run alphafold3 for target only and get results like:
target_seq/
β”œβ”€β”€ seed-1_sample-0/
β”œβ”€β”€ target_confidences.json
β”œβ”€β”€ target_data.json
β”œβ”€β”€ target_model.cif
β”œβ”€β”€ target_summary_confidences.json
β”œβ”€β”€ ranking_scores.csv
└── TERMS_OF_USE.md
  • Extract Target templates and MSAs

A script is provided to extract MSAs and templates from *_data.json under af3 outputs, and use it as below:

python ./utils/af3_utils.py \
  --root  ./target_seq  \
  --presearch_outdir ./msa_templates

please set --root to your target_seq/ directory.

Outputs Files:

msa_templates/
β”œβ”€β”€ index.json
β”œβ”€β”€ msa.json
β”œβ”€β”€ target_seq_A_0.cif
β”œβ”€β”€ target_seq_A_1.cif
β”œβ”€β”€ target_seq_A_2.cif
β”œβ”€β”€ target_seq_A_3.cif
β”œβ”€β”€ target_seq_A_paired_msa.a3m
β”œβ”€β”€ target_seq_A_unpaired_msa.a3m
└── template_queries.json

index.json File Format under msa_templates/:

{
  "target_seq": {
    "unpairedMsaPath": "target_seq_A_unpaired_msa.a3m",
    "pairedMsaPath": "target_seq_A_paired_msa.a3m",
    "templates": [
      {
        "mmcifPath": "target_seq_A_0.cif",
        "queryIndices": [0,1,2, ...]
      },
      ...
    ]
  }
}
  1. Inference
  • Prepare the input JSON

    • Since AF3 and PyRosetta score are related to chain ids, Please set the binder as chain A and the target as chain B.
    • Add unpairedMsaPath, pairedMsaPath, and templates (for the target chain) from index.json into your AF3 input JSON.
    • Keep the binder chain MSA/templates empty (binder is designed de novo).
    • Mount the msa_templates/ absolute path into the Docker container.
  • Run Large-Scale AlphaFold3 Inference

Once you have pre-searched the MSA and templates and added them to the input JSON file, you can set the run_alphafold.py parameters as follows:

docker run -it \
    --volume $HOME/af_input:/root/af_input \
    --volume $HOME/af_output:/root/af_output \
    --volume <MODEL_PARAMETERS_DIR>:/root/models \
    --volume <DATABASES_DIR>:/root/public_databases \
    --volume <msa_templates_dir>:<msa_templates_dir>  \
    --gpus all \
    alphafold3 \
    python run_alphafold.py \
    --json_path=/root/af_input/fold_input.json \
    --model_dir=/root/models \
    --output_dir=/root/af_output  \
    --run_data_pipeline False \
    --num_diffusion_samples 1

this command is adapted from AlphaFold3 official repo

Please organize your AF3 outputs using the following layout::

af3_out/
β”œβ”€β”€ prot_binder_seq0/
β”‚   β”œβ”€β”€ seed-1_sample-0/
β”‚   β”œβ”€β”€ prot_binder_seq0_confidences.json
β”‚   β”œβ”€β”€ prot_binder_seq0_data.json
β”‚   β”œβ”€β”€ prot_binder_seq0_model.cif
β”‚   β”œβ”€β”€ prot_binder_seq0_summary_confidences.json
β”‚   β”œβ”€β”€ ranking_scores.csv
β”‚   └── TERMS_OF_USE.md
β”œβ”€β”€ prot_binder_seq1/
└── .../

Score

  • Score Metrics

For mini-protein, we present score metrics as below:

Score Model Metric Threshold (Pass Condition)
AlphaFold3 binder_ptm >= 0.8
AlphaFold3 ipae_min <= 1.5
PyRosetta ddg <= -44
PyRosetta sap_score <= 40
PyRosetta contact_molecular_surface >= 400

Note: Thresholds are example settings used in ODesign miniprotein campaign and should be tuned per target/protocol.

  • Launch Command
cd ODesign-pipeline
conda activate odesign
bash ./scripts/run_filter.sh \
  --exp_name pdl1_prot_binder \
  --af3_out ./examples/prot_binder/af3_out \
  --filter_outdir ./examples/prot_binder/filter_out \
  --filter_json ./filter/score_json/miniprotein_filter.json \
  --binder_chain A  \
  --target_chain B  \
  --rosetta_xml ./filter/rosetta_cmds/ppi.xml

please change --af3_out to your personal dir

Argument Description Example
exp_name Experiment name tag used to label outputs. pdl1_prot_binder
af3_out AF3 refold results directory to be filtered. ./examples/prot_binder/af3_out
filter_outdir Output directory for filter results. ./examples/prot_binder/filter_out
filter_json Path to the filtering config JSON (score index, thresholds). ./filter/score_json/miniprotein_filter.json
binder_chain Chain ID for the binder in the complex. A
target_chain Chain ID for the target in the complex. B
rosetta_xml RosettaScripts XML used for scoring protocol. ./filter/rosetta_cmds/ppi.xml
  • Understand the Outputs

File Architecture

filter_outdir/
β”œβ”€β”€ af3_filter/
β”‚   β”œβ”€β”€ exp_name_af3_filtered/
β”‚   β”œβ”€β”€ exp_name_af3_pass.csv
β”‚   └── exp_name_af3_fail.csv
β”œβ”€β”€ rosetta_filter/
β”‚   β”œβ”€β”€ exp_name_rosetta_filtered/
β”‚   β”œβ”€β”€ exp_name_rosetta_pass.csv
β”‚   └── exp_name_rosetta_fail.csv
└── exp_name_filter_final.csv

exp_name_filter_final.csv: Final CSV merging AF3 and Rosetta results.
exp_name_rosetta_filtered/: Structure files (PDB) for designs that passed both AF3 and Rosetta filter.
*_fail.csv: Failed designs and the corresponding failure reasons.
*_pass.csv: Designs that passed the filter.
*_filtered/: Structure files for designs that passed the filter.

Info - exp_name_filter_final.csv

Type Columns
Basic Info sample_id,binder_seq,target_seq,pdb_dir,summary_json
AF3 Confidences iptm,binder_ptm,complex_ptm,chain_ptm_avg,ipae_min,ipae_avg
Rosetta Scores ddg,sap_score,contact_molecular_surface

Designs are automatically ranked by ddg in exp_name_filter_final.csv, following the ODesign protein binder protocol.

License

  • ODesign-pipeline (this repository) source code is released under the Apache License 2.0 (see LICENSE).
  • This repository does NOT distribute any third-party model weights, databases, or proprietary software (including but not limited to AlphaFold3 parameters and PyRosetta distributions).
  • To run the pipeline, users must obtain and install required third-party components from their official sources and comply with their respective licenses/terms.

About

design pipeline for protein, nucleic acid, and small molecule

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors