Large Language Models (LLMs) have demonstrated exceptional utility but are vulnerable to adversarial jailbreak attacks, where carefully crafted prompts bypass safety mechanisms, causing LLMs to generate harmful content. Existing defense mechanisms, including supervised fine-tuning, preference optimization, and pre-stage LLM-based detectors, struggle against advanced jailbreaks that use iterative adversarial optimization. These attacks generate patterns that distract LLMs, leading to high attack success rates.
This research highlights a key observation: current text-based jailbreaks exhibit limited generalization when transferred to other modalities, such as text-embedded images. Based on this, we propose a novel jailbreak detection method named Vision-Language-Model for Jailbreak Defense (VLM4JD). VLM4JD encodes textual prompts into visual signals and leverages the multimodal understanding capabilities of Vision-Language Models (VLMs) to detect harmful intent.
VLM4JD is lightweight, training-free, and highly effective. Our empirical evaluations on large jailbreak datasets (JailbreakBench, HarmBench) demonstrate that VLM4JD significantly reduces Attack Success Rates (ASR), outperforming traditional text-based detectors. This work uncovers the cross-modal generalization limitations of current jailbreak attacks and offers VLM4JD as a robust defense, aiming to enhance the understanding and mitigation of such vulnerabilities.
main.py: The main script to run jailbreak detection experiments using VLM4JD and other baseline defenses.utils.py: Contains utility functions for processing prompts, interacting with models, and evaluating outputs (e.g., judging harmfulness).defences.py: Implements the VLM4JD defense mechanism along with other comparative defense strategies (e.g., SmoothLLM, Guard Model, Self-Eval).requirements.txt: (Recommended) A file listing all Python dependencies.LICENSE: The license for this project.
- GPU: We run all experiments using a node with 8 * A100 80G
- CPU: AMD EPYC 7k62
- Operating System: Ubuntu 22.04
- Python: Python 3.10 or newer
- Key Python Libraries:
torchtransformersvllmjailbreakbenchscikit-learn
-
Clone the repository:
git clone https://github.com/xiyuanyang45/VLM-for-jailbreak-defense cd VLM4JD -
Set up a Python virtual environment (recommended):
python3 -m venv venv source venv/bin/activate -
Install Python dependencies: It's recommended to create a
requirements.txtfile with all necessary packages. If you have one:pip install -r requirements.txt
Alternatively, install the core packages manually:
pip install torch transformers vllm jailbreakbench scikit-learn Pillow
-
Hugging Face Hub: If you plan to use models that require authentication (e.g., some Llama models), log in using the Hugging Face CLI:
huggingface-cli login
-
Datasets: This project utilizes datasets like JailbreakBench and HarmBench. The
jailbreakbenchlibrary typically handles the download and management of its standard benchmarks. Theget_all_promptsfunction inutils.py(or similar) should interface withjailbreakbenchto load adversarial prompts. For thexstestdataset used for over-rejection tests (get_overrej_test_prompts), ensure the dataset is available in the expected path or provide instructions if manual download/setup is needed.
The main script main.py is used to run experiments. Below are some examples:
To run VLM4JD defense against GCG attacks on Vicuna-13b-v1.5 using Phi-3.5-vision as the VLM detector on the JailbreakBench dataset:
python main.py \
--method GCG \
--model_been_attack vicuna-13b-v1.5 \
--vlm microsoft/Phi-3.5-vision-instruct \
--defences vlm \
--dataset jailbreakbenchYou can specify multiple VLMs for an ensemble defense by comma-separating them in the --vlm argument:
python main.py \
--method GCG \
--model_been_attack vicuna-13b-v1.5 \
--vlm "microsoft/Phi-3.5-vision-instruct,Qwen/Qwen2-VL-2B-Instruct" \
--defences vlm \
--dataset jailbreakbenchTo evaluate the attack success rate without any defense:
python main.py \
--method GCG \
--model_been_attack vicuna-13b-v1.5 \
--wo_defence 1 \
--dataset jailbreakbenchTo test SmoothLLM defense (variant 1):
python main.py \
--method GCG \
--model_been_attack vicuna-13b-v1.5 \
--defences smoothllm \
--smooth_var 1 \
--dataset jailbreakbenchTo test Llama Guard 3.1B as a defense:
python main.py \
--method GCG \
--model_been_attack vicuna-13b-v1.5 \
--defences guard_model \
--guard_model meta-llama/Llama-Guard-3-1B \
--dataset jailbreakbenchTo test the over-rejection rate of VLM4JD on the xstest dataset (assuming xstest contains benign prompts):
python main.py \
--test_over_rej 1 \
--over_rej_dataset xstest \
--defences vlm \
--vlm microsoft/Phi-3.5-vision-instruct--method: Adversarial attack method (e.g.,GCG).--model_been_attack: The target LLM being attacked (e.g.,vicuna-13b-v1.5,llama-2-7b-chat-hf).--vlm: The VLM(s) used for the VLM4JD defense. Can be a single model or comma-separated for ensemble. Shorthand names (e.g.,phi3.5) or full Hugging Face paths are supported.--defences: Comma-separated list of defense methods to apply (e.g.,vlm,smoothllm,guard_model,self_eval,llm_eval,vlm_text).--wo_defence: Set to1to run without any defense.--test_over_rej: Set to1to test over-rejection on benign prompts.--over_rej_dataset: Specifies the dataset for over-rejection testing (e.g.,xstest).--dataset: Specifies the primary jailbreak dataset (e.g.,jailbreakbench,harmbench).--smooth_var: Variant for SmoothLLM defense.--guard_model: Path to the guard model forguard_modeldefense.--llm_eval: Path to the LLM used forllm_evaldefense.--vlm_text: Path to the VLM used forvlm_text(text-only modality of VLM) defense.
Comming soon.
This project is licensed under the terms of the LICENSE file.