A modular pipeline for detecting co-evolving residues between interacting protein pairs, inspired by Green et al., 2021, Nature Communications.
This pipeline enables the detection and visualization of co-evolving positions between two protein sequences, leveraging large-scale homology search, multiple sequence alignment (MSA), and statistical coupling analysis. It is designed for flexibility, reproducibility, and ease of use.
- Automated BLAST homology search with assembly-level filtering for taxonomic diversity
- Pairing of homologs from the same genome assembly
- Concatenated MSA construction and quality filtering
- Coevolution analysis using PLMC
- Publication-quality heatmaps and network visualizations
- Modular, script-driven workflow
- Clone the repository:
git clone <repo-url> cd protein-interaction-coevolution
- Run setup:
This will install all required Python dependencies automatically.
chmod +x setup.sh ./setup.sh
- External tools:
- MAFFT and PLMC are automatically installed by the setup script via conda.
- If you need to install them manually:
- MAFFT:
conda install -c conda-forge mafft - PLMC:
conda install -c bioconda plmc
- MAFFT:
-
Prepare your input:
- Place your two query FASTA files as
queryA.fastaandqueryB.fastaininputs/find_homologues/PROJECT/. - Set the
PROJECTvariable in the Makefile (currently set toABO).
- Place your two query FASTA files as
-
Run the pipeline:
make find # Find and pair homologs make msa # Build and filter concatenated MSA make coevolution # Run PLMC coevolution analysis make heatmap # Generate heatmaps and network visualizations
- Description:
- Runs
scripts/blast_and_pair.pyto perform BLASTP searches for both query proteins against NCBI nr, filters by assembly, and pairs homologs from the same genome.
- Runs
- Input:
inputs/find_homologues/PROJECT/queryA.fastainputs/find_homologues/PROJECT/queryB.fasta
- Output:
results/find_homologues/PROJECT/paired_sequences.json
- Description:
- Runs
scripts/build_msas.shto construct a concatenated MSA of paired homologs using MAFFT, with post-processing to remove low-quality columns.
- Runs
- Input:
results/find_homologues/PROJECT/paired_sequences.json
- Output:
results/msas/PROJECT/proteinAB.aln
- Description:
- Runs
scripts/coevolution_analysis.pyto compute coevolutionary couplings using PLMC, outputting a CSV of residue-residue coupling strengths.
- Runs
- Input:
results/msas/PROJECT/proteinAB.aln
- Output:
results/coevolution/PROJECT/coevolution_results.csv
- Description:
- Runs
scripts/generate_heatmaps.pyto create heatmaps and network diagrams of the strongest coevolving residue pairs.
- Runs
- Input:
- All previous outputs
- Output:
results/heatmaps/PROJECT/(PNG images, CSV mappings)
- Paired homologs: JSON file with all paired sequences
- MSA: Concatenated and filtered alignment in FASTA format
- Coevolution results: CSV of coupling scores
- Visualizations:
- Global heatmaps at multiple thresholds
- Network diagrams of inter-protein couplings
- Mapped coupling CSVs for downstream analysis
- Python 3.7+
- Biopython
- pandas
- numpy
- matplotlib
- scipy
- pyyaml
- tqdm
- pysam
- MAFFT (external, for MSA)
- PLMC (external, for coevolution analysis)
This pipeline is inspired by:
Green, A.G., Elhabashy, H., Brock, K.P. et al. Large-scale discovery of protein interactions at residue resolution using co-evolution calculated from genomic sequences. Nat Commun 12, 1396 (2021). https://doi.org/10.1038/s41467-021-21636-z
If you use this pipeline, please cite the above work and this repository.
Distributed under the MIT License. See LICENSE for details.