VirulentHunter: Deep Learning-Based Virulence Factor Predictor Illuminates Pathogenicity in Diverse Microbial Niches

1. Database construction

In this study, we focused exclusively on VFs found in bacteria. To do this, we gathered all bacterial VF protein sequences from three public databases, which were VFDB 2022[1], Victors[1] and BV-BRC[3]. Subsequently, these sequences were clustered with CD-HIT[4] v4.8.1, and the duplicates were removed with 100% sequence identity and 80% coverage, yielding 30,483 non-redundant VFs. Because many of the collected VFs lacked category information, we implemented a rigorous label propagation strategy to annotate them using the 14 primary categories defined by the VFDB. Initially, we performed sequence-based clustering using DIAMOND with a sequence identity threshold of 80% and a coverage threshold of 80%. Within each cluster, we assigned all member VFs the combined labels from the union of their existing annotations. To further enhance category assignment, we employed TM-Vec[5], a deep learning tool for structural similarity detection. VFs were clustered using TM-Vec with a threshold of 0.9, and, following the same label propagation strategy, each VF within these clusters was assigned the combined labels from the union of their existing annotations.

2. Model architecture

We present VirulentHunter, a novel deep learning framework for simultaneous VF identification and classification directly from protein sequences. We constructed a comprehensive, curated VF database by integrating diverse public resources and rigorously expanding VF category annotations. Benchmarking demonstrates that VirulentHunter significantly outperforms existing methods, particularly for VFs lacking detectable homology.

3. Dependencies

python 3.8.13
pytorch 2.4.1
transformers 4.44.2
biopython 1.83
peft 0.7.1

4. Example usage

To use the VirulentHunter codes, you first need to download the 'esm2_t30_150M_UR50D' model and put it under the fold of 'models/', and run the following command:

predict.py -i data/test.fasta -o results/predict_results.txt

5. Training and Analyzing Custom

Binary task:

python main.py --esm_path models/esm2_t30_150M_UR50D --input_fasta_path data/binary/ --input_label_path data/binary/ --max_len 2000

VF category task:

python main_multi_label.py --esm_path models/esm2_t30_150M_UR50D --input_fasta_path data/multi-label/train.fasta --input_label_path data/multi-label/train_labels.pkl --max_len 2000

6. Web server

We are pleased to announce the launch of our web service, VirulentHunter, designed for the identification and prediction of virulence factors (VFs). The platform supports the input of protein sequences, strain genomes, and metagenomic data. VirulentHunter analyzes the input data and provides detailed outputs, including the identified virulence factors (if the query is classified as a VF) and their corresponding prediction probabilities. Visit the website at http://www.unimd.org/VirulentHunter to explore its capabilities. For users with large-scale data prediction needs, we recommend deploying a local version of the service to ensure efficient and customized processing.

References

[1] B. Liu, D. Zheng, S. Zhou, L. Chen, and J. Yang, “VFDB 2022: a general classification scheme for bacterial virulence factors,” Nucleic Acids Research, vol. 50, no. D1, p. D912, Jan. 2022, doi: 10.1093/nar/gkab1107.

[2] S. Sayers et al., “Victors: a web-based knowledge base of virulence factors in human and animal pathogens,” Nucleic Acids Research, vol. 47, no. D1, pp. D693–D700, Jan. 2019, doi: 10.1093/nar/gky999.

[3] . D. Olson et al., “Introducing the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): a resource combining PATRIC, IRD and ViPR,” Nucleic Acids Research, vol. 51, no. D1, pp. D678–D689, Jan. 2023, doi: 10.1093/nar/gkac1003.

[4] L. Fu, B. Niu, Z. Zhu, S. Wu, and W. Li, “CD-HIT: accelerated for clustering the next-generation sequencing data,” Bioinformatics, vol. 28, no. 23, pp. 3150–3152, Dec. 2012, doi: 10.1093/bioinformatics/bts565.

[5] T. Hamamsy et al., “Protein remote homology detection and structural alignment using deep learning,” Nat Biotechnol, pp. 1–11, Sep. 2023, doi: 10.1038/s41587-023-01917-2.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
models		models
.gitignore		.gitignore
README.md		README.md
example.ipynb		example.ipynb
focal_loss.py		focal_loss.py
main.py		main.py
main_multi_label.py		main_multi_label.py
predict.py		predict.py
protein_dataset.py		protein_dataset.py
train.py		train.py
train_multi_label.py		train_multi_label.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VirulentHunter: Deep Learning-Based Virulence Factor Predictor Illuminates Pathogenicity in Diverse Microbial Niches

1. Database construction

2. Model architecture

3. Dependencies

4. Example usage

5. Training and Analyzing Custom

6. Web server

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VirulentHunter: Deep Learning-Based Virulence Factor Predictor Illuminates Pathogenicity in Diverse Microbial Niches

1. Database construction

2. Model architecture

3. Dependencies

4. Example usage

5. Training and Analyzing Custom

6. Web server

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages